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This  Yearly  Technical  Report  covers  research  carried  out  on  the  Advanced  Networking  and  Dis¬ 
tributed  Systems  Contract  at  UCLA  under  DARPA  Contract  Number  MDA  972-91-J-lOll  cov¬ 
ering  the  period  from  June  1,  1991  through  May  30,  1992.  Under  this  contract  we  have  the  fol¬ 
lowing  statement  of  work  comprising  five  tasks: 


STATEMENT  OF  WORK 
Topic  A:  High  Speed  Networking 

Task  Al:  Fast  Packet  Switching  Using  Multistage  Interconnection  Networks 

We  propose  to  investigate  the  performance  of  a  variety  of  Multistage  Intercoimection  Networics 
such  as  the  Stailite  network.  We  will  develop  analytical  models  to  evaluate  the  throughput  and 
response  time  of  the  overall  traffic  in  the  case  of  uniform  traffic  as  well  as  certain  forms  of  hot 
spot  traffic.  We  will  also  evaluate  the  behavior  of  Message  Combining  to  eliminate  the  effects  of 
hot  spots.  A  transformation  and  superposition  method  is  being  developed  to  be  used  with  die 
analytical  model  to  evaluate  any  given  general  traffic  pattern  (e.g.,  multiple  hot  spots).  A  delay 
model  analysis  comparing  the  (fiscarding  switch  and  the  blocking  switch  will  also  be  developed. 
We  also  propose  to  study  a  structured  buffered  pool  scheme  to  prevent  normal  traffic  from  being 
blocked  by  the  saturated  tree  caused  by  hot  spot  traffic. 

Task  A2:  Analysis  Of  Competing  Lightwave  Networks 

The  use  of  Wavelength  Division  Multiple  Access  (WDMA)  optical  switching  for  high-speed 
packet  networks  is  a  predictabte  development  in  the  evolution  of  fast  packet  switching.  We  pro¬ 
pose  to  evaluate  the  behavior  of  single-hop  WDMA  qrtical  switching,  using  agile  receiver 
filters.  Whereas  our  main  thrust  will  be  on  diese  single-h<q)  structures,  we  will  also  look  at 
multi-hop  access  using  fixed  filters.  We  will  compare  the  response  time,  blocking  and 
throughput  foe  each. 


The  views  and  conclusions  contained  in  diis  document  are  those  of  the  audiors  and  should  not  be 
interpreted  as  necessarily  representing  the  official  policies,  either  expressed  or  implied,  of  the 
Defense  Advanced  Research  Projects  Agency  or  the  United  States  Government 
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TOPIC  B:  ARCHITECTURE  AND  PARALLEL  PROCESSING 


Task  Bl:  Performance  Of  Boolean  n-Cube  Interconnection  Networks 

We  propose  to  evaluate  the  performance  of  Boolean  n-cube  interconnection  networks  for  paral¬ 
lel  processing  systems.  The  focus  will  be  on  data  communication  issues  rather  than  on  process¬ 
ing  issues.  By  exploiting  the  homogeneity  property  of  Boolean  n-cube  interconnection  networks, 
we  can  design  non-blocking  routing  algorithms  with  limited  size  buffers.  A  technique  called  re¬ 
ferral  is  used  to  guarantee  that  every  node  accepts  all  the  messages  transmitted  from  its  neigh¬ 
bors.  This  type  of  routing  algorithm  is  critical  in  any  implementation.  Store-and-forward  is  one 
such  routing  sdgorithm.  In  this  scheme,  time  is  divided  into  cycles  to  which  the  networic  is  syn¬ 
chronized.  In  each  cycle  every  node  simultaneously  transmits  some  of  its  stored  messages  to  its 
neighbors.  An  analytical  model  will  be  developed  to  predict  the  network  performance  under  dif¬ 
ferent  traffic  patterns.  We  also  intend  to  design  an  intelligent  routing  algorithm  to  improve  the 
performance.  Another  routing  scheme  to  consider  is  a  modified  version  of  virtual  cut-through. 
Virtual  cut-through  is  a  scheme  such  that  when  a  message  arrives  at  an  intermediate  node  and  its 
selected  outgoing  channel  is  free,  then  the  message  is  sent  to  the  adjacent  node  before  it  is  com¬ 
pletely  received  at  this  intermediate  node.  Therefore,  the  delay  due  to  unnecessary  buffering  in 
front  of  an  idle  channel  is  avoided.  Modified  virtual  cut-through  is  also  a  non-blocking  algo¬ 
rithm.  We  will  investigate  the  (positive  or  negative)  effect  of  adding  additional  buffers  to  a  node 
in  this  case.  We  are  further  interested  in  non-uniform  traffic  problems  in  Boolean  n-cube  net¬ 
works. 

We  also  propose  to  study  the  performance  of  these  networks  in  a  hostile  and/or  unreliable  en¬ 
vironment.  In  this  environment,  nodes  and  links  may  disappear  and  also  unreliable  (i.e.,  noisy) 
transmissions  may  occur. 

Task  B2  :  Distributed  Simulation 

Parallel  asynchronous  simulation  methods  (such  as  Time  Warp)  offer  an  optimistic  alternative  to 
synchronous  conservative  approaches  to  distributed  simulation.  We  propose  to  evaluate  the 
speedup  of  P  processors  conducting  a  parallel  asynchronous  simulation  using  anal3dic  and  simu¬ 
lation  tools.  We  already  have  an  exact  solution  for  the  case  of  two  processors  (P=2).  Also,  we 
have  upper  bounds  on  the  best  one  can  do  by  letting  the  P  processors  run  ahead  of  each  other  as 
compa^  to  forcing  them  to  synchronize  at  every  step.  We  are  interested  in  extending  the  results 
to  P  processors  and  to  include  the  effect  of  queued  messages.  Furthermore,  we  propose  to  inves¬ 
tigate  the  use  of  the  linear  Poisson  process  as  a  model  for  these  systems. 

Task  B3:  A  New  Model  Of  Load  Sharing 

We  are  interested  in  studying  the  behavior  of  interacting  processes  which  gobble  up  processing 
resources  in  their  neighbOThood.  In  particular,  if  we  begin  with  a  one-dimensional  world,  we  can 
place  processes  on  a  ring,  where  there  is  a  quantity  of  processing  power  distributed  uniformly 
around  the  ring.  A  process  requires  a  changing  amount  of  processing  capacity.  As  its  needs  in¬ 
crease,  the  process  attempts  to  grow  in  both  directions  along  the  ring  until  it  either  has  enough 
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capacity,  or  it  bumps  into  another  process  moving  in  its  direction,  in  which  case  they  both  stop 
moving  toward  each  other.  As  time  progresses,  a  process  may  or  may  not  have  all  the  capacity  it 
needs.  The  object  is  to  study  the  response  time  of  jobs  represented  by  such  processes  in  a  limit¬ 
ed  resource,  competitive  environment.  Qearly,  this  model  extends  to  higher  dimensions,  and  we 
propose  to  study  the  case  where  processors  are  distributed  over  a  multi-dimensional  hypersphere. 
The  effect  of  distributed  load  sharing  in  this  environment  will  be  evaluated. 


Accomplishments  for  this  Period 

During  we  have  graduated  2  Ph.D.  students,  have  had  7  papers  published,  7  papers  accepted  and 
S  papers  submitted  to  the  professional  literature.  Progress  in  this  research  effort  is  moving  along 
very  nicely. 

We  have  encountered  no  obstacles  to  our  research  and  have  in  fact  made  excellent  progress  to¬ 
ward  our  stated  goals.  In  particular  with  respect  to  fast-packet  switching  using  multistage  inter¬ 
connection  networks,  we  have  indeed  carried  out  the  an^ysis  of  finite  and  infinite  buffered  net¬ 
works  with  arbitrary  traffic  patterns  and  a  variety  of  internal  operating  procedures.  An  example 
of  this  woik  is  contained  in  the  attached  report  by  Lin  and  Kleinrock  entitled  "Performance 
Analysis  of  Finite  Buffered  Multistage  Interconnection  Networks  with  a  Gener^  Traffic  Pat¬ 
tern". 


In  the  area  of  lightwave  networks,  we  have  carried  out  the  evaluation  of  wavelength  division 
multiplexing  for  local  area  networks  and  for  passive  star  topologies,  example  of  this  work  is 
contained  in  the  attached  publication  by  Lu  and  Kleinrock  entitled  "A  Wavelength  Division 
Multiple  Access  Protocol  for  High  Speed  Local  An»  NetwotkywifiTa  Passive  Star  Topology”. 

In  the  area  of  Boolean  n-cube  interconnection  networks,  we  have  studied  routing  in  such  net¬ 
works  under  stable  conditions  with  finite  buffers  (the  typical  case)  and  then  further  extended 
these  studies  to  deal  with  the  case  of  unreliable  nodes,  thereby  providing  a  fault  tolerant  routing 
procedure!  the  paper  by  Homg  and  Kleinrock  entitled  "Fault  Tolerant  Routing  with  Regularity 
Restbiafioh  in  Boolean  n-Cube  Interconnection  Networks"  is  included  in  the  appendix  of  this  re¬ 
port  as  an  example  of  that  research. 

In  the  area  of  distributed  simulation  we  have  carried  out  extensive  analyses  of  the  behavim-  of 
time  warp  systems  fOT  the  two  processor  case;  these  are  exact  results  and  have  led  die  way  ftx* 
considerable  research  in  this  areaDc^l^e  enclosedf  paper  by  Felderman  and  Kleinrock  'Two  Pro¬ 
cessor  Time  Warp  Analysis:  Some  R^sutte  on  a  Ui^ying  Approach"  summarizes  some  of  our 
results  to  date. 

We  have  begun  to  look  at  some  of  the  new  issues  involved  in  gigabit  networks  since  they  are 
such  an  important  part  of  the  advanced  networking  arena  these  days.  The  paper  by  Kleinrock  en¬ 
titled  The  Latency/Bandwidth  Tradeoff  In  Gigabit  Networks"  is  a  summary  of  our  preliminary 
thinking  in  this  area. 
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Below  we  list  our  publications  for  this  period.  Progress  continues  in  all  these  areas  and  we  are 
very  much  encouraged  with  the  prospects  for  additional  insight  and  results  in  a  variety  of  these 
systems.  In  addition  we  have  launched  a  related  study  on  the  collective  behavior  of  a  large 
number  of  mobile  automata  whose  behavior  is  characteristic  of  advanced  networidng  and  ad¬ 
vanced  distributed  systems.  Progress  in  this  area  will  be  reported  in  future  technical  reports. 
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Ph.D  DISSERTATIONS  COMPLETED 

1.  Shen,  Shioupyn,  "The  Virtual-lime  Data-Parallel  Machine",  December  1991. 

2.  Homg,  Ming-yun,  "Analysis  of  Boolean  n-Cube  Interconnection  Networics  For  Multipro¬ 
cessor  Systems",  March  1992 

PUBLISHED  PAPERS 

1.  Lin,  T.I.  and  L.  Kleinrock,  "Performance  Analysis  of  Finite-Buffered  Multistage  Inter¬ 
connection  Networks  with  a  General  Traffic  Pattern,"  1991  ACM  Sigmetrics,  Conference 
on  Measurement  and  Modeling  of  Computer  Systems,  May  21-24, 1991,  San  Diego,  CA. 

2.  Homg,  Ming-yun  and  L.  Kleinrock,  "On  the  Performance  of  a  Deadlock-Free  Routing 
Algorithm  for  Boolean  n-Cube  Interconnection  Networks  with  Finite  Buffers,"  1991 
International  Coftference  on  Parallel  Processing,  pp.  111-228-111-235,  August  12-16, 
1991,  Pennsylvania  State  University. 

3.  Homg,  Ming-yun  and  L.  Kleinrock,  "Fault  Tolerant  Routing  With  Regularity  Restoration 
in  Boolean  n-Cube  Interconnection  Networks"  Proceedings  of  the  Tlurd  IEEE  Symposi¬ 
um  on  Parallel  and  Distributed  Processing  Dallas,  Texas,  Dewmber  2-5,  1991,  pp.  458- 
465. 

4.  Felderman,  R.  and  L.  Kleinrock,  "Two  PocessOT  Conservative  Simulatitm  Analysis," 
published  1992  PADS  Workshop,  January  1992. 

5.  Lu,  J.  and  L.  Kleinrock,  "An  Access  Protocol  for  High-Speed  Optical  LANs."  ACM  CSC 
’92  Kansas  Qty,  MO,  March  3-5, 1992,  pp.  287-293. 

6.  Lu,  J.  and  L.  Kleinrock,  "On  The  Performance  Of  Wavelength  Division  Multiple  Access 
Networks", /CC'92. 

7.  Kleinrock,  L.,  "The  Latency/Bandwidth  Tradeoff  In  Gigabit  Networks",  IEEE  Communi¬ 
cations  Magazine,  April  1992,  Vol.  30,  No.4,  Pp.36'40. 
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PAPERS  ACCEPTED  FOR  PUBLICATION 


1.  Lu,  J  and  L.  Kleinrock  "Performance  Analysis  of  Single-Hop  Wavelength  Division  Mul¬ 
tiple  Access  Networks"  accepted  xo  Journal  of  High-Speed  Networks  1992. 

2.  Kleinrock,  L.  and  R.  Felderman,  "Two  Processor  Time  Warp  Analysis:  A  Unifying  Ap¬ 
proach,"  accepted  to  International  Journal  in  Computer  Simulation. 

3.  Huang,  J.H.  and  L.  Kleinrock,  "Performance  Evaluation  of  Dynamic  Sharing  of  Proces¬ 
sors  in  Two-Stage  Parallel  Processing  Systems,”  accepted  for  publication  in  IEEE  Tran¬ 
sactions  on  Parallel  and  Distributed  Systems. 

4.  Lu,  J.  and  L.  Kleinrock,  "A  Wavelength  Division  Multiple  Access  Protocol  for  High¬ 
speed  Local  Area  Networks  with  a  Passive  Star  Topology,"  accepted  Performance 
Evaluation, 

5.  Shen,  S.  and  L.  Kleinrock,  "The  Virtual  Time  Data-Parallel  Machine"  accepted  IEEE 

6.  Felderman,  R.  and  L.  Kleinrock,  'Two  Processor  Time  Warp  Analysis:  Capturing  the 
Effects  of  Message  Queueing  and  Rollback/State  Saving  Costs,"  accepted  for  publication 
in  the  ACM  Transactions  on  Modeling  and  Computer  Simulation, 

7.  Felderman,  R.  and  L.  Kleinrock,  "Bounds  and  Approximations  for  Self-Initiating  Distri¬ 
buted  Simulation  Without  Lookahead,"  accepted  to  ACM  Transactions  on  Modelling  and 
Computer  Simulation,  special  issue  on  Distributed  and  Parallel  Simulation  Performance, 

PAPERS  SUBMITTED  FOR  PUBLICATION 

1.  Green,  J.  and  L.  Kleinrock,  "Static  Load  Sharing  of  Processors  in  Broadcast  Networks", 
submitted  for  publications  in  the  IEEE  Transactions  on  Parallel  and  Distributed  Systems, 

2.  Kleinrock,  L.  and  W.  Korfhage,  "Collecting  Unused  Processing  Capacity:  An  Analysis 
of  Transient  Distributed  Systems,"  submitted  for  publication  in  the  IEEE  Transactions  on 
Parallel  and  Distributed  Systems, 

3.  Mehovic,  Farid  and  L.  Kleinrock,  "An  Approach  to  Modeling  Optimistic  Concurrency 
Control,”  submitted  fev  publication  in  the  Communications  of  the  ACM, 

4.  Powley,  C,  C.  Ferguson  and  R.  Korf,  "Depth-First  Heuristic  Search  on  a  SIMD 
Machine,”  submitted  to  AI  Journal, 

5.  Shen,  S.  and  L.  Kleinrock,  "The  Effect  of  Network  Topology  On  the  Performance  of 
Load  Balancing  in  Distributed  Systems”  submitted  to  Perfarmance  Evaluation. 
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Fault- Tolerant  Routing  with  Regularity  Restoration 
in  Boolean  n-Cube  Interconnection  Networks* 


Mmg-yim  Horng  and  Leonard  Klemrock 
Computer  Science  Department 
University  of  California,  Los  Angeles 
Los  Angeles,  CA  90024 


Abstract 

This  paper  proposes  a  set  of  techniques  to  restore  the 
regularity  of  a  Boolean  n-eube  network  in  the  presence 
of  node  failures,  and  algorithms  to  effectively  route 
messages  among  the  surviving  nodes.  An  analytical 
model  to  evaluate  the  degradation  of  a  damaged  net¬ 
work  is  also  presented. 

One  way  to  restore  the  regularity  of  a  damaged 
Boolean  n-eube  rtetwork  is  by  simply  disabling  the 
nodes  with  more  Burn  one  bad  neighbor.  The  remain- 
ing  network  is  called  a  "l-degraded  subnet*  A  very 
simple  optimal-path  routing  algorithm,  which  requires 
each  node  to  know  only  its  neighbor's  status,  is  devel¬ 
oped  for  such  a  subnet.  Since  many  nonfavlty  nodes 
may  have  to  be  disabled  in  constructing  a  t-degraded 
subnet,  we  further  develop  a  heuristic  algorithm  to  re¬ 
store  the  rutwork's  regularity  by  oorutructing  a  "sub¬ 
net  connected  with  optimal  paths  (SCOP),"  rohere  only 
a  few  nodes  must  be  disabled.  The  routing  algorithm 
used  tn  1-degraded  subnets  also  works  for  SCOPs.  To 
preserve  the  processing  power  of  the  network,  we  also 
propose  a  two-level  hierurehioal  fauBrUderant  routing 
scheme  without  disabling  any  nodes. 


1  Introduction 

A  major  problem  in  dedcning  a  multiprooesaor  sys¬ 
tem  is  to  construct  a  reliable  interconnection  network 
which  provides  efficient  routing  of  messagsB  among  pro¬ 
cessors.  Recently,  the  Boolean  n-cube  network  (also 
known  ss  the  hypercube  network)  has  become  a  widdy 
accepted  interconnection  architecture  due  to  its  topo¬ 
logical  properties  as  discussed,  for  example,  in  (!].  Sev¬ 
eral  researdi  and  commercial  systems  built  on  this  type 
of  interconnection  are  now  available  [2]. 


*Thii  work  WM  rapportod  by  tho  DitaM  Adraaeod  tUMoreh 
ProJoeW  Aymcf  ondw  Coalnet  MDA  90M7-C086*,  fValW  Syw 
tern  Uborotory. 


The  success  of  the  simple  routing  algorithms  [3]  used 
in  Boolean  tveube  networks  is  based  on  the  networks’ 
regularity  properties.  Although  an  interconnection 
network  is  usually  operated  in  a  well-protected  envi¬ 
ronment,  faults  may  occur.  When  some  nodes  or  com¬ 
munication  links  fail,  the  regularity  of  this  "damaged” 
network  is  destroyed  and  the  routing  algorithm  may 
no  longer  be  applicable.  To  build  a  reliable  multipro¬ 
cessor  system,  the  presence  of  fault-tolerant  routing  to 
ensure  successful  communications  between  any  pair  of 
nonfaulty  nodes  is  essential.  Moreover,  since  the  chan¬ 
nel  speed  of  an  intercomiection  network  is  very  high, 
the  amount  of  time  that  a  node  can  afford  to  spend  in 
making  routing  decisions  is  severely  constrained.  It  is 
important  to  have  the  routing  algorithm  as  simple  as 
posnble  and  hardware  realizable. 

To  successfully  route  messages  in  a  damaged 
Boolean  n-cube  network,  either  the  surviving  nodes 
or  the  messages  must  be  equipped  with  information 
about  the  locations  of  the  faults.  Several  algorithms 
requiring  each  node  of  the  network  to  know  only  the 
status  of  its  local  components  (links  and  nodes)  have 
been  presented  in  [4,  5].  However,  the  limitatim  of 
these  approaches  is  that  either  the  total  number  of 
faulty  components  is  very  restricted  (e.g.  less  than 
n)  or  the  number  of  hope  traversed  by  a  message  may 
grow  without  bound.  These  problems  can  be  solved  by 
providing  each  node  with  more  information  and  having 
it  compute  a  "saf^  route  for  each  message.  Algorithms 
that  require  eadi  node  to  know  the  global  status  of  the 
network  have  been  reported  in  [6].  One  can  even  as¬ 
sume  the  surviving  part  of  the  netwwk  has  an  arbitrary 
topology,  in  whidi  case  eadi  node  maintains  a  routing 
table  as  used  in  networks  such  as  the  ARPANET  [7 , 8]. 
However,  as  the  network  grows  in  size,  the  amount  of 
storage  space  and  time  needed  to  maintain  and  update 
these  routing  tables  become  prohibitive  (9). 

Since  a  number  of  faults  in  a  ridily-oonnected 
Boolean  n-cube  network  may  not  destroy  its  entire  reg¬ 
ularity,  the  routing  algoritlm  may  take  advantage  of 
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the  renuining  top<4ogical  regularity.  Chen  and  Shin 
(10]  developed  and  analyzed  a  set  of  fault-tolerant  rout¬ 
ing  algorithms  based  on  the  depth-first  seardi  princi¬ 
ple.  In  their  algOTithms,  each  message  contains  a  tag 
to  keep  trade  of  the  path  traveled  so  far  to  avoid  vis¬ 
iting  a  node  more  than  once.  To  tolerate  more  than 
n  —  1  faults,  a  more  complicated  procedure  is  required 
to  guide  backtracking  whenever  a  message  reaches  a 
dead  end.  Thus,  the  length  of  the  packets  is  variable 
and  the  computation  overhead  is  not  trivial.  To  further 
guarantee  that  every  message  is  routed  to  its  destina¬ 
tion  via  a  shortest  path,  every  node  must  be  equipped 
with  nonlocal  status  [llj. 

In  this  paper,  we  begin  with  a  queueing  model  to 
evaluate  the  degradation  of  a  damaged  Boolean  n-cube 
network  with  node  fulures.  We  then  develop  a  set  of 
techniques  to  restore  the  regularity  of  the  network,  and 
algorithms  to  effectively  route  messages  among  the  sur¬ 
viving  nodes.  Our  algorithms  work  under  any  number 
of  faults  as  long  as  the  network  remains  connected. 

We  restore  the  regularity  of  a  damaged  Boolean  n- 
cube  network  by  disabling  the  nodes  with  more  than 
one  bad  neighbor.  The  remaining  network  is  called  a 
“l-degraded  subnet.”  We  then  develop  a  very  simple 
routing  algorithm  for  sudi  a  1-degarded  subnet.  With 
this  algorithm,  eadi  node  only  needs  to  know  the  status 
of  its  neighbors,  and  every  message  is  routed  to  its 
destination  via  an  “optimal  path”  (to  be  defined). 

Though  the  1-degraded  subnet  can  easily  be  con¬ 
structed  in  a  distributed  manner,  many  nonfaulty 
nodes  may  have  to  be  disabled.  We  develop  a  heuristic 
algorithm  to  construct  a  subnet  in  which  every  pair  of 
surviving  nodes  are  coimected  with  at  least  an  “opti¬ 
mal”  path.  In  this  paper,  sudi  a  subnet  is  called  a 
SCOP  (Subnet  Connected  with  Optimal  Paths).  We 
show  that  only  a  small  number  of  nonfaulty  nodes  will 
be  disabled.  The  optimal-path  routing  algorithm  for  a 
l-degraded  subnet  also  works  in  a  SCOP.  Some  simu¬ 
lation  results  are  presented  and  compared  with  a  lower 
bound  we  develop. 

To  fiilly  preserve  the  processing  power  of  the  net¬ 
work,  we  further  develop  a  two-level  hierardiical  fault- 
tolerant  routing  scheme  without  disabling  any  non¬ 
faulty  nodes.  With  this  approadi,  a  noti-optimally 
coimected  network  is  deoom|>oeed  into  a  set  of  dusters 
such  that  every  duster  forms  a  subcube  with  the  same 
property  as  a  ^OP.  Each  node  maintains  a  small  rout¬ 
ing  table  in  whidi  every  entry  of  the  table  corresponds 
to  a  destination  duster.  Messages  are  first  routed  to 
their  destination  dusters  by  use  of  these  routing  tables. 
After  a  message  has  arriv^  at  its  destination  duster, 
it  is  then  routed  to  its  destination  via  an  optimal  path. 


Figure  1;  A  Boolean  4-cube  network  with  node  faults. 
2  Preliminaries 

A  Boolean  n-cube  network  consists  of  2”  nodes,  each 
addressed  by  an  n-bit  binary  number  from  0  to  2^  - 1. 
(See  Figure  1,  where  faulty  nodes  are  drawn  as  blade 
dots.)  Nodes  are  interconnected  in  sudi  a  way  that 
there  is  a  bidirectional  link  between  two  nodes,  say  t 
and  j,  if  and  only  if  |t  —j\  =  2^  for  some  integer  k  from 
0  to  n  -  1;  in  this  case,  we  say  that  these  two  nodes 
are  linked  together  in  dimension  k.  For  example,  in 
a  Boolean  4-cube  network,  nodes  1000  and  1010  are 
linked  together  in  dimension  1.  It  can  be  seen  that  by 
removing  all  the  links  in  any  particular  dimension,  a 
Boolean  n-cube  network  is  separated  into  two  (n  - 1)- 
cube  networks. 

Every  “subcube”  in  a  Boolean  n-cube  network  can 
be  uniquely  addressed  by  a  string  of  n  symbols  drawn 
from  the  set  {0,  1,  X},  where  X  is  a  don’t  care  symbol 
(10).  For  example,  in  a  Boolean  4-cube  network,  nodes 
0001,  0011,  0101  and  0111  form  a  subcube  addressed 
by  OXXl.  A  node  is  itself  a  subcube. 

The  Hamming  distance  between  any  two  nodes  is 
defined  as  the  number  of  bits  whidi  differ  between  their 
addresses.  The  length  of  a  path  from  one  node  to  an¬ 
other  is  defined  as  the  number  of  links  on  the  path. 
An  “optimal  path”  between  two  nodes  is  a  path  whose 
length  is  equal  to  their  Hamming  distance.  A  node 
might  not  be  able  to  communicate  with  another  via 
an  optimal  path  in  a  damaged  network.  For  exam¬ 
ple,  in  Figure  1,  node  0110  cannot  communicate  with 
node  0101  via  an  optimal  path.  However,  they  are  able 
to  communicate  with  eadi  other  via  the  path  through 
nodes  1110, 1100  and  1101.  This  path  is  called  a  short¬ 
est  path  since,  among  all  the  possible  remaining  paths 
between  these  two  nodes,  its  length  is  minimal. 

Routing  algorithms  for  Boolean  n-cube  networks 
can  be  found,  for  example,  in  [3].  Let  the  header  (ad¬ 
dress  portion)  of  a  newly  generated  message  be  the 
exdusive-OR  of  the  message’s  source  and  destination 
ar^^reases.  Every  one-bit  in  the  header  corresponds  to  a 
valid  dimension  over  which  the  message  can  be  sent  one 
hop  closer  to  its  destination.  V/hen  a  message  is  sent 
over  a  valid  dimension,  the  corresponding  one-bit  is 
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dianged  to  zero.  Here,  the  selection  of  a  possible  valid 
dimension  for  transmission  can  be  adaptive  to  traffic. 
A  message  reaches  its  destination  when  its  header  con¬ 
tains  only  zeroes.  It  is  clear  that,  with  this  algorithm, 
all  messages  are  routed  to  their  destinations  via  their 
optimal  paths.  However,  this  routing  algorithm  cannot 
work  for  such  a  damaged  network  as  shown  in  Figure 
1  since  all  '^he  optimal  paths  between  nodes  0110  and 
0001  are  blocked. 

In  this  paper,  we  make  the  following  assumptions: 

•  The  remaining  network  is  connected. 

•  Since  nodes  (or  processors)  are  more  complex  than 
links  and  therefore  have  higher  failure  rates,  we 
assume  only  node  failures.  To  consider  a  link  fail¬ 
ure  between  two  nonfaulty  nodes,  one  may  disable 
one  of  these  two  nodes.  In  [12],  our  algorithms  are 
extended  to  consider  link  failures. 

•  E^ach  node  knows  the  status  of  its  neighboring 
nodes. 

•  A  node  cannot  transmit  a  message  to  a  faulty 
neighboring  node. 

•  Messages  are  only  destined  for  nonfaulty  nodes. 


3  Degradation  of  Networks  with  Node 
Faults 

In  this  section,  we  present  a  simple  queueing  model 
to  evaluate  how  much  a  network  is  degraded  by  node 
faults.  In  many  cases  it  is  likely  that  the  failure  rate  of 
multiprocessor  systems  is  very  small.  Let  each  node  fail 
independently  with  probability  p.  We  also  assume  that 
the  arrival  of  input  messages  to  eadi  node  follows  a 
Poisson  process  with  a  rate  of  A  messages  per  unit  time; 
message  lengths  are  random  and  drawn  independently 
from  an  exponential  distribution.  Since  a  link  survives 
if  and  only  if  both  nodes  at  its  ends  survive,  we  have 

Prob[A  fink  survives.)  =  (1  -p)*. 

Figure  2  shows  a  Boolean  n-cube  network  with  a 
“cut”  in  its  third  dimension.  The  expected  number  of 
surviving  links  crossing  any  dimension  is  given  by 

(1) 

We  further  assume  that  messages  are  uniformly  des¬ 
tined  to  all  other  surviving  nodes  in  the  network.  Thus, 

a  s  traffic  intensity  from  a  given  source 
to  a  particular  destination 
A 

2”(l-p)-l. 


Figure  2:  A  cut  in  dimension  3.  (Surviving  links  cross¬ 
ing  the  cut  are  shown  as  heavy  lines). 


Ftom  Figure  2  we  note  that  every  message  whidi 
is  generated  from  a  node  in  the  left  subcube  which  is 
destined  to  a  node  in  the  right  subcube  must  travel  over 
one  of  the  surviving  links  in  order  to  cross  the  “cut”. 
If  a  message  travels  along  an  optimal  path  between 
these  two  nodes,  the  message  must  travel  across  the 
“cut”  exactly  once.  We  further  assume  that  the  traffic 
crossing  this  “cut”  is  perfectly  balanced.  The  average 
number  of  surviving  nodes  in  each  subcube  is  2”~*(1  - 
p).  Eadi  will  send  a  units  of  traffic  to.  every  other 
surviving  node  in  the  network.  Thus  each  node  will 
send  a[2^^(l— p)]  units  of  traffic  across  each  cut.  Since 
there  are  2”~*(1  — p)  nodes  in  eadi  subcube  doing  this, 
and  traffic  is  balanced  on  eadi  link,  we  have 

p  =  IVaffic  load  per  channel 
^  [2”-»(l  -p)]^a 

2"-»(l  -p)* 

A 

2(l-p)-2>-". 

Here,  the  diaimel  is  unidirectional.  Obviously,  this 
model  yields  an  optimistic  bound. 

We  apply  Kleinrodc’s  Independence  Asawnption  [8] 
whidi  is  often  used  in  the  delay  analysis  of  communica¬ 
tion  networks.  This  assumption  states  that  eadi  time 
a  message  is  received  at  a  node  within  the  network,  its 
transmission  time  is  dioeen  independently  from  an  ex¬ 
ponential  distribution.  We  assume  the  mean  transmis¬ 
sion  time  of  a  message  equals  one  unit  of  time.  Thus, 
eadi  diannel  is  modeled  as  an  M/M/1  system  with 
Poisson  arrivals  at  a  rate  A/[2(l  -  p)  -  2^~”]  and  with 
an  exponential  service  time  whose  mean  is  one  unit  of 
time.  In  order  for  this  system  to  be  stable,  we  require 
that  p  <  1,  that  is, 

A<2(l-p)-2‘-".  (5) 


(2) 

(3) 

(4) 


We  define 


7  B  Throughput  of  the  network, 
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then  WB  have 


7  =  A2»(l-p)  (6) 

<  2(2“(l-p)-ll(l-p).  (7) 

where  2[2"(1  —p) — 1](1  -p)  is  clearly  the  mean  network 
communication  capacity.  The  mean  message  delay  is 
then  given  by  [8] 


T  * 


P 

^7l-P 

Af  /  — J— \  1 

\A2"(1-p)/\i-5^j^/ 


where  M  is  the  number  of  surviving  diannels  in  the 
network  (Af  *  2"“‘(1  -p)*2n).  Thus, 


^  n(l-p) 

2(l-p)-2»-**-A. 


(8) 


We  further  obtain  the  following  approximations  for 
n  >>  1. 


(9) 


7<r+‘(i-p)*,  (10) 


and 


r« 


"(1-p) 
2(1 -p) -A. 


(11) 


In  Figure  3  we  show  the  mean  message  delay  ob¬ 
tained  from  our  optimistic  model  for  a  Boolean  4-cube 
network  with  two  different  failure  rates.  We  also  ran  a 
flow  deviation  program  [14]  to  find  the  minimal  adiiev- 
able  dday.  We  find  that  our  assumptions  are  appro¬ 
priate  if  failure  rates  are  small. 

Moreover,  in  most  queueing  systems,  two  perfor¬ 
mance  measures,  response  time  and  throughput,  com¬ 
pete  with  each  otto.  Typically,  hy  raising  the 
throughput  of  the  system,  whidi  is  durable,  the  mean 
response  time  is  also  raised  whidi  is  undesirable.  Here, 
we  combine  the  throughput  and  the  mean  message  de¬ 
lay  of  the  network  into  a  sin^e  measure,  potoer,  which 
is  defined  as  fbilows  (13). 

-  Throughput  of  the  network 

"  Mean  message  delay 

A  system  is  said  to  be  operating  at  an  optimal  point 
if  the  power  at  that  point  is  maxindied.  Fbr  n  »  1, 
we  find  that  power  is  maximiied  when  A  « 1  -p  whidi 
is  equal  to  half  the  maximum  allowed  throu^put  per 
node,  as  found  in  [13|. 


figure  3:  Minimal  achievable  dday  of  a  Boolean  4-cube 
network  with  two  different  failure  rates. 

4  Routing  in  1-Degraded  Subnets 

A  network  is  said  to  be  h-degraded  if  every  surviving 
node  in  the  network  has  at  most  k  “had^  (to  be  dis¬ 
cussed)  neighbors.  A  damaged  network  can  easily  be 
made  ib-degraded  in  a  distributed  maimer  as  follows: 
Every  surviving  node  (or  nonfaulty  node  initially)  has 
a  list  whidi  gives  the  status  of  its  nei^boring  nodes. 
Every  surviving  node  diedcs  its  list  and  disables  itself 
if  it  has  more  than  k  '%ad”  neighbors;  in  this  case,  it 
must  inform  all  its  surviving  neighbors  of  the  change 
in  its  status.  Every  surviving  node  keeps  updating  its 
list  until  it  disables  itsdf  or  the  disabling  process  stops. 
Here,  during  each  step  of  the  iteration,  the  'T>ad”  no^ 
indude  all  faulty  nodes  and  all  nodes  whidi  have  been 
(fisabled  in  previous  iterations.  We  call  the  remaining 
network  a  "A-d^raded  subnet.” 

We  now  present  a  very  simple  additive  routing  al¬ 
gorithm  fbr  1-degraded  Boolean  n-cube  subnets.  This 
algorithm  requires  each  node  to  know  only  its  neigh- 
bms’  status.  We  let  neighbor  jtatuM  be  an  n-bit  binary 
number  in  which  a  bit  is  set  to  one  if  its  correspond¬ 
ing  nei^bor  is  surviving.  Otherwise,  the  bit  is  reset 
to  zero.  This  routing  algorithm  is  shown  in  Figure  4, 
where  *&””  is  a  bit-wise  AND  function.  We  note  that, 
with  this  algorithm,  every  message  is  routed  to  its  des¬ 
tination  along  an  optirnd  path. 

The  proof  that  this  routing  algorithm  works  fbr  1- 
degraded  Boolean  n-cube  subnets  is  as  follows:  If  a 
node  receives  a  message  with  more  than  one  one-bit  in 
its  header,  the  node  surdy  can  find  a  valid  dimension 
(or  channd)  to  transmit  the  measage.  If  the  message 
has  only  a  single  one-bit  in  its  header,  then  the  ootre- 
sponding  ndghbor  must  be  surviving.  Otherwise,  the 
assumption  that  messages  are  only  destined  fbr  surviv- 
ii^  nodes  is  violated. 
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When  a  message  is  received. 

If  ( header  s  0 ) 

Itend  the  message  to  the  local  processor, 
else 

valid  channels  <•  header  &  neigbhor_status. 
Randomly  select  a  l>blt  from  valld_channels. 
Gunge  the  selected  1-btt  In  the  header  to  0. 
Send  the  message  over  the  selected  channel 


Figure  4:  The  optimal-path  routing  algorithm  for  1- 
degraded  subnets 


Figure  5:  Percentage  of  surviving  nodes  in  the  1- 
degraded  subnets. 

This  approach  works  well  in  the  situation  where  a 
whole  cluster  of  nodes  has  been  'bombed  out.”  As  an 
example  of  sudi  spatially  correlated  faults,  one  may 
consider  a  power-supply  failure  whidi  disables  the  en¬ 
tire  cluster  of  nodes  supported  by  it. 

The  disadvantage  of  this  approadi  for  1-degraded 
subnets  is  that,  in  a  large  Boolean  n-cube  network 
with  moderate  failure  rates,  since  every  node  has  a 
large  number  of  neighbors,  the  probability  that  a  node 
has  more  than  one  bad  ndghbw  can  be  large.  As  a  re¬ 
sult,  many  nodes  may  have  to  be  disabled;  hence,  the 
computational  power  of  the  gystem  is  significantly  re- 
due^.  Figure  5  shows,  for  networks  of  different  sizes, 
the  percentage  of  nodes  that  remain  alter  the  disabling 
iteration  setttes  down,  given  that  nodes  initially  fail  in¬ 
dependently  with  a  given  probability. 


5  Construction  of  a  SCOP 

In  this  section,  we  develop  a  heuristic  algmlthm  to 
construct  a  subnet  where  every  pair  of  surviving  nodes 
is  connected  with  at  least  one  optimal  path  (or  we  say 


every  pair  of  nodes  is  optimally  connected).  We  call 
such  a  subnet  a  SCOP  (Subnet  Connected  with  Opti¬ 
mal  Paths).  We  show  that  our  routing  algorithm  for 
1-degraded  subnets  m)rks  in  sudi  a  subnet. 

In  a  Boolean  n-cube  network,  a  node  and  its  k  neigh¬ 
bors  can  uniquely  identify  a  Boolean  k-subcube,  where 
0  <k  <n.  We  note  that  sudi  a  subcube  is  the  small¬ 
est  subcube  containing  the  node  and  its  k  neighbors. 
For  example,  in  a  Boolean  4-cube  network,  the  node 
0110  and  nodes  0100  and  0111  can  identify  the  subcube 
OIXX.  Moreover,  any  two  nodes  which  are  k  hops  away 
in  distance  can  also  identify  a  Boolean  ib-subcube. 

We  assume  there  is  a  central  control  unit  that  col¬ 
lects  information  from  every  surviving  node  of  the  net¬ 
work  and  makes  decisions  about  how  to  disable  a  node. 
Here  is  a  heuristic  algorithm  for  constructing  a  SCOP: 
We  let  Listfi}  be  a  chedc-list  which  contains  all  sur¬ 
viving  nodes  with  t  bad  neighbors.  Again,  the  '1>ad” 
nodes  include  all  faulty  nodes  and  all  nodes  having 
been  previously  disabled.  Nodes  on  List {i}  have  higher 
priority  for  disablement  than  any  other  nodes  on  the 
list  with  smaller  t.  That  is,  the  node  with  most  bad 
neighbors  (worst  connection)  has  the  highest  priority 
of  being  disabled. 

We  choose  a  node,  say  node  jf,  from  the  highest  pri¬ 
ority  non-empty  check-list,  and  simply  find  the  small¬ 
est  subcube  containing  node  j  and  all  its  bad  neighbors 
(e.g.  in  Figure  1,  node  0110  and  subcube  OXXX).  It  is 
clear  that  without  routing  through  the  links  outside  the 
subcube,  node  j  cannot  communicate  with  any  other 
surviving  nodes  of  the  subcube.  If  the  total  number  of 
surviving  nodes  in  the  subcube  is  more  than  2,  node 
j  is  disabled.  If  the  number  of  surviving  nodes  in  the 
subcube  is  exactly  2,  we  choose  to  disable  either  one  of 
them  (See  [12].).  Otherwise,  node  j  is  safe  at  this  mo¬ 
ment  and  is  removed  from  the  lists.  A  safe  node  may  be 
brought  back  to  the  diedc-lists  if  any  of  its  nei^bors 
is  disabled.  The  algorithm  stops  when  every  surviving 
node  is  safe.  Figure  6  illustrates  a  resulting  SCOP  for 
the  sample  Boolean  4-cube  network  as  shown  in  Fig¬ 
ure  1.  In  [12],  we  show  the  number  of  surviving  nodes 
of  the  SCOft  obtained  from  our  heuristic  algorithm  is 
very  dose  to  the  number  of  surviving  nodes  achievable 
by  an  exhaustive  search. 

\Ve  now  prove  the  optimal-path  routing  algorithm 
we  developed  for  1-degraded  subnet  works  for  a  SCOP. 
If  the  destination  node  of  a  message  is  k  hops  away 
from  the  node  where  the  message  is  currently  resid¬ 
ing,  the  current  node  should  have  k  valid  dimensions 
to  choose  frxim.  Since  the  ib-subcube  identified  by  the 
current  node  and  the  destination  node  is  optimally 
connected,  the  current  node  must  be  able  to  find  at 
least  one  valid  dimension  to  transmit  the  message. 
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ONonftuIty  #  Faulty  ODiaabled 

Figure  6:  A  SCOP  of  the  Boolean  4-cube  as  shown  in 
Figure  1. 


P 


Figure?:  Percentage  of  nodes  remaining  in  the  SCOPs. 


Thus,  if  every  subcube  is  optimally  connected,  then 
the  optimal-path  routing  algorithm  works. 

Figure  7  shows  the  percentage  of  nodes  which  re¬ 
main  in  the  SCOP,  given  that  nodes  initially  fail  inde¬ 
pendently  with  a  given  probaUIity.  Comparing  this 
with  the  percentage  of  nodes  whidi  remain  in  the 
1-degraded  subnet,  we  find  the  number  of  surviving 
nodes  is  dramatically  increased.  In  Figure  8,  we  com¬ 
pare  these  two  disabling  schemes  by  showing  the  per¬ 
centage  of  surviving  nodes  in  a  Boolean  8-cube  net¬ 
work. 

It  is  very  difficult  to  analytically  evaluate  the  per- 
fbrmance  of  arbitnury  netwwks  in  a  dynamic  tr^c 
environment.  In  this  paper,  routing  in  the  SCOP» 
is  extensively  simulated,  lb  veri^  tiie  effectiveness 
of  our  routing  algorithm,  for  eadi  failure  pattern,  we 
also  ran  a  flow  deviation  program  to  find  ^  minimal 
adiievable  delay.  These  rmults  are  also  compared  with 
the  optimistic  bound  obtained  in  Section  3.  Figure  9 
riwws,  for  different  input  rates,  the  mean  meange  de¬ 
lay  in  the  SCOPa  of  a  Bodean  6-cube  netwmk.  The 
results  with  96%  confidence  shown  hen  wen  from  100 
randomly  generated  pattema,  each  containing  6  foulty 


Figure  8:  CompariscMi  of  percentage  of  surviving  nodes, 
n»8. 


Figure  9:  Mean  delay  of  the  SCOPs  of  Boolean  6-cube 
netowrks.  Eadi  network  has  6  faulty  nodes. 


nodes.  We  find  that  the  mean  message  delay  is  very 
close  to  the  minimal  achievable  bound,  whid  is  also 
very  dose  to  the  optimistic  bound. 


6  Two-level  Hierarchical  Routing 


In  this  section,  without  disabling  any  nonfaulty 
nocfos,  we  nation  the  regularity  of  a  dama^  Boolean 
n-cube  network  by  decompodng  the  network  into  a  set 
of  dusters  sudi  that  every  duster  forms  a  subcube  with 
the  same  property  of  a  SCOP  (i.e.  every  pair  of  the 
surviving  nodes  of  a  duster  is  connected  with  at  least 
an  optimal  path).  A  teo-fevel  hierurddeal  routing  al¬ 
gorithm,  whidi  requires  every  node  to  maintain  asmaD 
routing  table,  is  then  devdoped. 
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node 

bad  dimensions 

0000 

12 

0001 

1  3 

0101 

01 

0110 

0  1  2 

1011 

1  3 

Table  1:  Bad  dimensions  for  eadi  surviving  node  with 
more  than  one  bad  neighbor. 


6.1  The  Two-level  Network  Decomposi¬ 
tion 

A  non-optimally  connected  Boolean  rv^be  network 
is  decomposed  into  a  set  of  cltisters  as  follows.  For  all 
the  surviving  nodes  which  have  more  than  one  bad 
neighbor,  the  central  control  unit  counts  the  number 
of  bad  links  in  eadi  dimension.  The  central  control 
unit  then  chooses  the  dimension  having  the  most  bad 
links  and  cuts  the  network  into  two  clusters  along  this 
dimension.  Each  of  these  two  clusters  is  an  (n  -  1)- 
subcube.  A  subcube  must  be  further  decomposed  if 
not  every  pair  of  surviving  nodes  of  the  subcube  is 
optimally  connected. 

Again,  as  an  example,  let  us  examine  the  Boolean 
4-cube  network  with  S  faulty  nodes  as  shown  in  Figure 
1.  Clearly  the  network  is  not  optimally  connected.  In 
Table  1  we  show,  for  each  siuviving  node  with  more 
than  one  bad  neighbor,  the  dimensions  along  which 
its  neighbors  are  bad.  In  this  example,  dimension  1 
has  the  the  maximum  number  of  bad  links  (i.e.  5). 
We  decompose  the  network  along  dimension  1.  As  a 
result,  the  network  is  separated  into  the  following  sub¬ 
cubes:  XXOX  and  XXIX.  In  this  case,  both  of  these 
two  subcubes  are  optimally  connected.  The  two-level 
hierarchical  structure  is  shown  in  Figure  10. 


6.2  Routing  in  the  Two-level  Hierarchical 
Network 

Messages  must  first  be  routed  to  their  destination 
clusters.  Every  surviving  node  maintains  a  cluster 
routing  table  with  one  entry  for  each  destination  clus¬ 
ter.  Each  entry  gives  the  address  of  a  destination  clus¬ 
ter,  the  best  outgoing  diannel  for  that  cluster,  and  a 
relative  weight  (usually  dday  is  used  as  the  weight). 
Any  algorithm  (e.g.  the  ARPANET-Iike  algorithm) 
can  be  used  to  maintain  the  cluster  routing  table. 

When  a  node  receives  a  transit  message,  if  the  des¬ 
tination  node  of  the  message  does  not  belong  to  the 


Figure  10:  A  two-level  hierarchical  structure  of  a  4- 
cube  network  with  5  node  faults  as  shown  in  Figure 
1. 


cluster  in  whidi  the  node  resides,  the  message  is  sent 
to  a  neighbor  based  on  the  node’s  cluster  routing  ta¬ 
ble.  Otherwise,  the  message  is  routed  to  its  destination 
node  using  the  routing  algorithm  for  1-degraded  sub¬ 
nets.  Tb  exploit  the  possible  multiple  paths  from  one 
node  to  another  in  a  cluster  and  balance  the  network’s 
traffic,  the  message  is  sent  along  a  most  lightly  loaded 
diannel  in  its  destination  duster.  We  may  further  im¬ 
prove  the  performance  by  providing  multipath  routing, 
where  eadi  entry  of  the  routing  table  gives  multiple 
choices  of  outgoing  diannels. 

6.3  Discussion 

This  hierarchical  routing  appioadi  has  the  following 
advantages; 

•  No  good  links  or  nodes  are  eliminated.  The  pro¬ 
cessing  power  of  the  nonfaulty  part  of  the  network 
is  fully  maintained. 

•  The  size  of  the  routing  table  is  significantly  re¬ 
duced  from  the  ARPANET-like  routing  table.  In 
[12],  we  show  that  the  number  of  dusters  gener¬ 
ated  by  decomposition  cannot  exceed  the  number 
of  faulty  nodes  in  the  network. 

•  Dm  optimal-path  routing  algorithm  for  1- 
degraded  subnets  works  for  eadi  cluster. 

•  The  number  of  hops  traversed  by  a  message  is 
bounded. 

•  The  increase  in  the  mean  path  length  caused  by 
hierarchical  routing  is  typically  very  small  [12|. 
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7  Conclusions 

In  this  paper,  we  first  developed  a  queueing  model  to 
evaluate  the  degradation  of  a  damaged  Boolean  n-cube 
network  with  node  faults.  We  next  developed  an  adap¬ 
tive  fault-tolerant  routing  algorithm  for  1-d^rad^ 
subnets.  This  algorithm  is  very  simple;  it  makes  rout¬ 
ing  decisi<»is  based  only  on  the  node’s  local  status. 
This  algorithm  routes  every  message  to  its  destination 
via  an  optimal  path. 

We  further  exploited  the  remaining  regularity  of 
a  damaged  Boolean  n-cube  network  and  developed  a 
heuristic  algorithm  to  construct  a  subnet  (i.e.  SCOP) 
in  whidi  every  pair  of  siuviving  nodes  is  connected 
with  at  least  one  optimal  path.  We  showed  the  al¬ 
gorithm  used  in  a  1-degraded  subnet  also  works  for  a 
SCOP.  Only  a  small  number  of  nonfaulty  nodes  must 
be  disabled.  The  performance  of  the  optimal-path 
routing  algorithm  in  SCOPs  was  studied.  We  found 
the  mean  message  delay  is  very  close  to  the  minimal 
adiievable  bound. 

Tb  preserve  all  the  nonfaulty  nodes  in  the  net¬ 
work,  we  also  developed  a  two-level  hierardiical  rout¬ 
ing  scheme.  A  damaged  Boolean  n-cube  network  is 
decomposed  into  a  set  of  dusters;  eadi  of  them  is 
a  SCOP.  Every  surviving  node  in  the  network  is  re- 
qvdred  to  maintiun  a  small  routing  table.  A  two-level 
hierarchical  routing  algorithm  has  also  been  developed. 
This  approach  maintains  the  network’s  ridi  connection 
without  disabling  a  single  nonfaulty  node.  In  (12],  we 
show  that  the  probability  of  routing  a  message  to  its 
destination  via  a  shortest  path  is  very  high  and  that  the 
increase  in  the  mean  path  length  caused  by  hierarchi¬ 
cal  routing  is  very  small.  More  simulation  resiilts  are 
being  collected  and  some  other  performance  measures 
such  as  the  mean  delay,  the  throughput  of  the  network 
and  hot  spot  problems  are  also  being  evaluated. 
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Abstract 

We  present  some  results  from  an  exact  anal¬ 
ysis  of  a  new  model  for  the  problem  of 
two  processors  running  the  Time  Warp  dis¬ 
tributed  simulation  protocol.  The  model 
creates  a  unifying  framework  for  previous 
work  in  this  area  and  additionally  provides 
some  clear  insight  into  the  operation  of  sys¬ 
tems  synchronized  by  rollback. 

1  Introduction 

Distributed  simulation  has  proven  to  be  an  impor¬ 
tant  application  for  parallel  processing.  Accordingly, 
sever"'  algorithms  have  been  developed  to  perform 
simula«>on  with  multiple  processors.  The  most  well 
known  techniques  are  generally  classified  into  two 
types;  conservative  [8]  and  optimistic  [2].  Generally 
speakin*;,  optimistic  simulation  on  multiple  proces¬ 
sors  is  a  technique  which  allows  each  processor  to 
proceed  with  its  portion  of  a  simulation  independent 
of  the  other  processors  (optimistically  assuming  that 
the  others  will  not  interact  with  that  processor);  if,  at 
a  later  time,  it  .".ids  that  some  other  processor  caused 
its  earlier  assumption  to  be  false,  it  will  roll  back  and 
proceed  forward  again.  Our  research  focuses  on  the 
analysis  of  the  average  case  behavior  of  Time  Warp, 
the  most  well  known  optimistic  technique. 

Very  little  work  has  appeared  in  the  literature 
which  discusses  average  case  behavior  of  Time  Warp 
(TW).  Lavenberg  et  al.  [5]  and  Mitra  and  Mitrani 
[9]  have  examined  models  similar  to  ours,  and  we  will 
address  their  relstiomhip  to  this  work  in  Section  5. 
Recently,  Lin  and  Lasowska  [6]  have  examined  Time 
Warp  and  conservative  meth^s  by  appealing  to  crit¬ 
ical  path  analysis.  Though  their  work  provides  im¬ 
portant  insights,  H  generates  different  types  of  results 
than  ours.  Finally,  Madisetti  [7]  provides  bounds 
on  the  performance  of  a  two  processor  system  where 
the  processors  have  different  speeds  of  processing  and 

’This  work  was  supported  by  the  Defense  Advanced 
Research  Projects  Agency  under  Contrau:t  MDA  903-87- 
C0663,  Parallel  Systems  Laboratory. 


move  at  constant  rates.  Madisetti  extends  his  model 
to  multiple  processors,  something  we  do  not  address 
in  this  work. 

The  next  section  introduces  our  model  for  Time 
Warp.  Section  3  provides  its  exact  solution  while 
in  SMtion  4  we  derive  some  performance  measures. 
In  Section  5  we  examine  the  model  as  we  take  limits 
on  various  parameters  and  discuss  the  relationship  of 
this  work  to  that  of  Lavenberg  et  al.  and  Mitra  and 
Mitrani.  Section  6  discusses  what  we  can  learn  from 
the  model.  Finally,  in  Section  7  we  provide  some 
concluding  remarks  and  notes  on  future  reseuch  di¬ 
rections. 

2  A  Model  for  Two  Time 
Warp  Processors 

Assume  we  have  a  job  which  is  partitioned  into  two 
processes,  each  of  which  is  executed  on  a  separate 
processor.  As  these  processes  are  executed,  we  con¬ 
sider  that  they  advance  along  the  integers  on  the 
x-axis  in  discrete  steps,  each  beginning  at  z  =  0  at 
time  f  =  0.  E!ach  process  independently  makes  jumps 
forward  on  the  axis  where  the  size  of  the  jump  is  geo¬ 
metrically  distributed  with  mean  !/)?<  (t  =1,2)  The 
amount  of  real  time  between  jumps  is  a  geometri¬ 
cally  distributed  number  of  time  slots  with  parame¬ 
ter  a.  (i  =  1,2).  After  process  t  makes  an  advance 
along  the  axis,  it  will  send  a  message  to  the  other 
process  with  probability  qi  (t  =  1,2).  Upon  receiv¬ 
ing  a  message  from  the  other  (sending)  process,  this 
(receiving)  process  will  do  the  following: 

Case  1:  If  its  position  along  the  x-axis  is  equal  to 
or  behind  the  sending  process,  it  will  ignore  the 
message. 

Case  2:  If  it  is  ahead  of  the  sending  process,  it  will 
immediately  move  back  (i.e.,  ‘’rollback”)  along 
the  x-axis  to  the  current  position  of  the  sending 
process. 

Let  /'(:)=  the  position  of  the  First  process  (pro¬ 
cess  one)  at  time  f  and  let  5(t)=  the  position  of  the 
Second  process  (process  two)  at  time  t.  Further,  let 

/7(<)  =  f’(<)-5(0. 
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(l><ld>0)  (IKkXO) 


Figure  1:  States  of  two  processors  at  times  ti  and  tj. 


D(t)  =  0  whenever  Case  2  occurs  (i.e.,  a  rollback). 
We  are  interested  in  studying  the  Markov  process 
D{t).  From  our  assumptions  that  F(0)  =  S(0)  =  0, 
we  have  D(Q)  =  0.  Clearly,  D(t)  can  take  on  any 
integer  value  (i.e.,  it  certainly  can  go  negative,  see 
Figure  1).  We  will  solve  for 

Pk  =  lim  P[Dit)  =  k]  t  =  0,1,2,... 

**-•00 

n*  =  lim  P[D(1)  = -it]  fc  =  1,2,3,... 

namely,  the  equilibrium  probability  for  the  Markov 
chain  D{t).  Moreover,  we  will  find  the  speedup  with 
which  the  computation  proceeds  when  using  two  pro¬ 
cessors  relative  to  the  use  of  a  single  processor. 

This  is  a  simple  model  of  the  Time  Warp  dis¬ 
tributed  simulation  algorithm  where  two  processors 
are  both  working  on  a  simulation  job  in  an  effort  to 
speed  it  up.  They  both  proceed  independently  until 
such  time  as  one  (behind)  process  transmits  a  mes¬ 
sage  in  the  “past”  of  the  other  (ahead)  process.  This 
causes  the  faster  process  to  “rollback”  to  the  point 
where  the  slower  process  is  located,  after  which  they 
advance  independently  again  until  the  next  rollback, 
etc.  The  interpretation  of  the  model  is  that  the  po¬ 
sition  of  each  process  on  the  axis  is  the  value  of  the 
local  clock  (or  virtual  time  of  the  message  being  pro¬ 
cessed)  of  each  process.  The  amount  of  real  time 
to  execute  a  particular  event  is  modelled  by  the  geo¬ 
metric  distribution  of  time  slots  between  jumps.  The 
jumps  in  virtual  time  indicate  the  increase  in  the  vir¬ 
tual  timestamp  from  one  event  to  the  next.  Messages 
passed  between  processors  (with  probability  9,)  have 
virtual  time  stamps  equal  to  the  virtual  time  of  the 
sending  process.  Our  model  assumes  that  states  are 
stored  after  every  event,  otherwise  a  rollback  would 
not  necessarily  send  the  processor  back  to  the  time  of 
the  tardy  message;  rather  it  might  have  to  roll  back 
to  a  much  earlier  time,  namely,  that  of  the  last  saved 


state.  Another  implicit  assumption  is  that  each  pro¬ 
cess  always  schedules  events  for  itself.  Finally,  the 
interaction  between  the  processes  is  probabilistic. 

3  Discrete  Time,  Discrete 
State  Analysis 

In  this  section  we  provide  the  exact  solution  for  the 
discrete  time,  discrete  state  model  introduced  in  Sec¬ 
tion  2.  Although,  as  we  proceed,  the  equations  may 
look  formidable,  the  analysis  is  quite  straightforward. 
First,  we  provide  some  definitions. 

Oj  =  P[i**  processor  advances  in  a  time  slot] 
a,  =  1  -  Oj 

Ai  =  aiSj  (Only  proc.  1  advances) 

Aj  =  ojcri  (Only  proc.  2  advances) 

As  =  aiQi  (Both  advance) 

A4  =  oioj  (Neither  advance) 

9j  =  P[processor  1  advances  j  units] 

=  Pi0i~\j>O)  (Pi  =  l-0i) 
fj  =  P[pTOces8or  2  advances  j  units] 

7  =  P[procs.  1  and  2  advance  the  same  dist.] 
_  PiPi 
1  —  01^2 

qi  =  P[  P**  proc.  sends  a  message] 

=  1  -  9i 

Since  the  transitions  in  our  system  are  quite  com¬ 
plex  (there  are  an  infinite  number  of  transitions  into 
and  out  of  each  state)  we  choose  to  show  the  state 
diagram  only  for  a  simplified  version  of  our  system 
where  0i  =  ^  =  I'm  Figure  2.  This  is  the  case  where 
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A*+ AsiiJT  At  +  Aj  A4>A3^  A4*A3q2 


From  ovary  stato  Rh 


From  ovary  atato  Pit 


the  processors  only  make  jumps  of  a  single  step  (from 
k  to  k  +  1). 

The  balance  equations  for  our  completely  general 
system  (no  restrictions  on  /?,  )  are: 

(Ai  +A2  +  >l3((l-7)  +  927))Pt  = 


Figure  2:  State  Diagram  for  0i  =  02  =  1. 

a  single  step  (from  , 

Po  =  1  -  2^P, -2^nf 
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By  using  the  technique  of  z-transforms  we  are  able 
to  solve  explicitly  for  pt  and  n*.  We  only  give  the 
results  here;  the  full  analysis  will  be  found  in  a  forth¬ 
coming  paper  [4]. 

We  obtain 

P*  =  Cppoi^)  it  >  1  (1) 

1  * 

n*  =  C’„po(— )  fc  >  1  (2) 

3l 


Po  = 


1  + 


Where 


Cp  = 


A7  (  53  '^ifk-i  +J2Pifk+i  ) 

C„ 

\i=0  i=l  / 

00 

+  Aiqi 

Kp 

i=*+l 

00  00 

+  ^39i  53 

i=*+l  ;=l 

Kn 

_ 0iD„{l  +  K„)_ _ 

(1  -  Ffp/£-„)(l  - /?,r2)(Ai -b /?i(>l2  +  >13)) 

_ 03Dn(l  +  Kp)_ _ 

(1  - /fpA:„)(l  - /?2S2)(/l2  + /?2(>ll  +  ^3)) 

_  _ 0102^P _ 

(1  —  0ir2){Al  +  0i(A2  +  >l3))(»'l  -  02) 

_  _ 020  l^n _ 

(1  —  ^2*2)(-^2  +  ^2('^1  +  •^3))(Sl  ~  0l) 

Dp  =  (A30i02q2'¥  A\(\.  —  0y0^) 

Dfi  =  (^3?2^t9l  +  ■'^2(1  ~  ^1^2)) 
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A  similar  (symmetric)  value,  0j,  can  be  found  for 
processor  two.  The  quantity  of  most  interest  though 
is  speedup,  and  we  calculate  its  value  in  the  next 
section. 


Op  =  Ai  +  +  Az) 

bp  =  -  {{Ai  +  i42  +  A3)(1  +  ^1^2)  +  AiPi^j 

+/?2?2("^J^l  ~  AzPi)) 

Cp  =  /?2(-^l  +  -^2  +  -As)  +  AzPilz 


(«1.«2)  = 


On  ±  Wbn  -AOnCn 


4.1  Speedup 

Using  the  formulas  for  pt  and  njb  we  can  calculate 
the  speedup  S  when  using  two  processors  versus  us¬ 
ing  only  one.  5  is  simply  the  rate  of  virtual  time 
progress  per  real  time  step  when  using  two  processors 
(Rz)  divided  by  the  rate  of  progress  when  using  only 
one  processor  (Ri).  The  rate  of  forward  progress  for 
one  processor  is  defined  simply  as  the  average  rate  of 
progress  of  the  two  processes 


=  A2  +  -^3) 

bn  =  —  ((Ai  -1- A2 -t- A3)(1 /Ji^2)  +  •^2/J2^1 
+A?i(Ai/32  “  AzPz)) 

Cn  =  -H  A2 -h  A3)  + 

4  Performance  Measures 

With  the  complete  solution  to  the  Markov  chain  in 
h2uid,  w^calculate  several  interesting  quantities.  The 
first  is  Ki  which  is  defined  as  the  average  distance 
that  processor  i  is  ahead  of  the  other.  This  measure 
is  useful  in  getting  a  fix  on  the  number  of  states  which 
will  need  to  be  saved  on  average. 

jr.  =  =  ^ 

'■  ■ 

Since  the  average  size  of  a  state  jump  at  processor  t 
is  l/0i  then  average  number  state  buffers  needed  at 
processor  t  is  Ki0i. 

Another  useful  measure,  ,  is  the  probability  that 
processor  one  is  ahead  of  processor  two  by  more  than 
6  units.  This  measure  is  exactly  the  probability  that 
a  fixed  size  state  buffer  of  size  6  at  processor  one 
overflows  if  =  /^  =  1  (if  only  single  steps  forward 
are  allowed). 

ej  =  Ep* 

*=*+1 

k  00 

t=l  k=l 


j;  _  + 

2 

The  rate  of  forward  progress  for  two  processors  is  the 
expected  “unfettered”  progress  (without  rollbacks) 
per  time  step  minus  the  expected  rollback  distance 
per  time  step  for  the  two  processors. 

00  *— 1 

-A372P0  53  9i  53  />  (»  -  i) 
i=3  ;=1 

00  t'-t 

-A3gipo53/*!Eff>(‘-» 

•*3  >stl 
00  *-l 

~A2q3^PkY^ifk-i 

k=l  i=l 

00  t— 1 

-Ai9i  53"*^*»*-» 

k=l  1=1 

00  00  t+i— 1 

-A392  53p*Z]^< 

k=l  i=l  j=l 

00  00  1 

-A391 5Z  /•  X]  jsk+i-j 

k=l  i=l  ;=1 

00  00  00 

-A39i  5Zp*  ^jh+i+i 

fc=l  i=l  i=l 

00  00  00 

-  A392  53  ”*  1C  ^Sk+i+j 

k=l  1=1  i=I 


As  with  the  pt  calculation  we  omit  the  derivation 
of  the  following  result.  Combining  all  the  terms  to¬ 
gether  we  find  the  formula  for  speedup. 
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5  = 


■^1  ,  -^2  ,4 


A30i07q2Po  _ 

A(1-^1^2)  m-A'Pi) 

A2q20\CpPori _ A\q\02CnPQ^i 

(ri  -  ^i)(ri  -  1)2  (si  -  ^2)(ai  -  1)2 

A3q202CpPO  f  (*^1  -  0l)  ,  0102^1  ^ 

{l-^i;92)(ri-l)2  V  01  {n-02)) 

A3qi0lCnpO  /  (ai  —  ^2)  ,  0201^1  ^ 

(l-;9i;52)(«l-l)2  V  02  (Sl-0i)J 

_  A3P0  /  q\0\02Cp  q2020lCn  ^1 

1-0X02  \02{rx-02)  0x{sx-0x)}\ 


5  Limiting  Behavior 

The  reason  we  chose  to  use  a  discrete  time,  discrete 
state  (DD)  model  was  to  allow  ourselves  to  take  lim¬ 
its  on  the  a  and  0  parameters  thus  creating  mod¬ 
els  which  are  continuous  in  time  and  state  (whereby 
geometric  distributions  become  exponential  distribu¬ 
tions).  We  omit  the  actual  formulae  here  due  to 
space  considerations;  see  [4]  for  the  full  details. 

We  can  transform  our  model  into  a  continuous 
time,  discrete  state  (CD)  model  by  taking  the  limit 
as  ai  and  03  — »  0  while  keeping  the  ratio  constant 

and  defining  =  a.  We  can  take  the  limit  either 

on  the  pk  equations  or  on  our  formula  for  speedup. 

Alternatively,  we  create  a  discrete  time,  continuous 
state  (DC)  model  by  taking  the  limit  as  0i  and  02  -* 
0  while  keeping  ^  =  b.  We  find  the  value  for  speedup 
by  taking  limits  on  the  speedup  formula  calculated 
for  the  discrete  time,  discrete  state  model. 

Finally,  we  can  solve  a  continuous  time,  continuous 
state  (CC)  model  by  taking  limits  on  Oi  and  0i  si¬ 
multaneously.  This  can  be  done  either  by  going  first 
to  the  CD  (oi)  or  DC  {0i)  model  from  DD,  and  then 
finishing  by  taking  limits  on  the  other  variable. 


5.1  Previous  Work  on  2-Processor 
Models 

There  has  been  some  similar  work  on  two-processor 
Time  Wtup  models.  Lavenberg,  Muntz  and  Samadi 
[5]  used  a  continuous  time,  continuous  state  model 
to  solve  for  the  speedup  (5^  of  two  processors  over 
one  processor.  Their  work  resulted  in  an  approxima¬ 
tion  for  S  which  was  valid  only  for  0  <  9i  <  0.05. 
Remember  that  qt  is  the  interaction  parameter;  the 
probability  that  processor  t  will  send  a  message  to 


the  other  processor.  Their  result  is  only  valid  for 
very  weakly  interacting  processes.  Our  result  for  this 
CC  case  has  no  restrictions  on  any  of  the  parame¬ 
ters  and  therefore  subsumes  their  work.  In  fact,  we 
can  compare  our  results  directly  for  a  simplified  case 
where  a  =  1/2  (same  processing  rate  for  both  proces¬ 
sors),  6  =  1  (same  average  jump  in  virtual  time  for 
both)  and  qi  =  q2  =  q  (same  probability  of  sending 
a  message),  which  is  the  completely  synunetric  case. 
Lavenberg  et  al.  derive  the  following  iq>proximation 
for  speedup: 

St  «  2  —  \/2q. 

Our  equation  for  speedup  in  this  restricted  case  is; 

S  =  ^  (5  +  g)  -f  (1  -h  g)  yS  -h  y) 

\/?(2  +  ?)  (7  -1-  j)  -f  \/&  +  q  (2  +  59-1-  ?*) 

If  we  expand  this  formula  using  a  power  series  about 
the  point  9  =  0  and  list  only  the  first  few  terms,  we 
see  the  essential  difference  between  our  result  and 
Lavenberg  et  al. 

Sw2- v^+|  +  0(9^) 

This  clearly  shows  that  our  result  matches  Lavenberg 
et  al.  in  the  first  two  terms.  We  see  that  their  result 
is  only  accurate  for  very  smaJl  values  of  9  as  they 
mention  in  their  paper.  Figure  3  shows  the  Laven¬ 
berg  et  al.  result  and  our  result  compared  to  simu¬ 
lation  with  99%  confidence  intervals. 


(q) 


Figure  3;  Compariscm  of  speedup  results  for  a  sim¬ 
plified  case. 

Mitra  and  Mitrani  [9]  also  solve  a  two-processor 
model  but  use  a  discrete  time,  continuous  state  ap¬ 
proach  to  solve  for  the  distribution  of  the  separation 
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between  the  two  processors  and  the  rate  of  progress 
of  the  two  processors.  In  the  definition  of  their 
model,  a  processor  sends  a  message  (with  probabil¬ 
ity  qi)  before  advancing.  Our  model  has  a  proces¬ 
sor  send  a  message  after  advancing.  This  difference 
between  the  two  models  disappears  in  the  calcula¬ 
tion  of  the  average  rate  of  progress.  Their  solu¬ 
tion  allows  a  general  continuous  distribution  for  the 
state  jumps  (virtual  time),  but  requires  (determinis¬ 
tic)  single  steps  for  the  discrete  time.  In  our  model 
this  is  equivalent  to  setting  oi  =  03  =  1.  Since 
our  analysis  only  supports  an  exponential  distribu¬ 
tion  for  state  changes,  but  their  analysis  doesn’t  have 
a  distribution  on  time,  neither  model  subsumes  the 
other. 

Finally,  the  DD  and  CD  models  do  not  seem  to 
have  appeared  in  the  literature,  although  a  simpli¬ 
fied  version  of  the  CD  model  where  0i  =  02  =  ^ 
(a  preliminary  version  of  this  work  which  only  al¬ 
lowed  single-step  state  jumps)  has  been  published  by 
Kleinrock[3]  .  Figure  4  shows  how  ail  of  this  work  fits 
together.  The  work  discussed  in  this  paper  covers  the 
shaded  region. 


sends  a  message  to  the  other  processor.  We  also  de¬ 
fine  a  as  the  ratio  where  A,-  is  the  rate  for  the 

continuous  time  distribution  for  processor  t  (rate  at 
which  messages  ue  processed).  Figure  5  shows  the 
speedup  for  the  Synunetric  case  where  qi  =  q2  =  q. 
Figure  6  shows  the  speedup  for  the  Balanced  case 


Figure  4:  Previous  work. 


Figure  5:  Speedup  for  the  Symmetric  case  91  =  93  = 
9 

where  Ai  =  A3.  Figure  7  shows  the  speedup  for 


6  Results  for  a  Restricted 
Model 

In  order  to  better  understand  our  results,  we  exam¬ 
ine  a  restricted  version  of  the  CD  model  (i.e.  the 
model  analyzed  in  [3]).  In  this  less  general  model 
we  eliminate  two  variables  by  forcing  the  processors 
to  advance  exactly  one  step  each  time  they  advance 
(01  =  02  =  1).  Again,  we  define  qi  as  the  inter¬ 
action  parameter;  the  probability  that  processor  i 


Figure  6:  Speedup  for  the  Balanced  case  Aj  =  A3  = 
A. 

the  extremely  simplified  Symmetric,  Balanced  case 
where  f  1  =  93  =  f  and  Aj  =  As  =  A.  For  this  special 
case  the  formula  for  speedup  is 

5=  — - — . 

2-1-V9 
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(« 


Figure  7;  Speedup  for  the  Symmetric,  Balanced  case 
<li  —  <l2  —  R  ^d  Aj  =  A2  =  A. 

6.1  Optimality  Proofs 

Using  the  simple  model  described  above,  we  can 
prove  several  results  about  optimality  with  respect 
to  the  parameters  of  the  system.  We  hrat  show  that 
the  speedup  is  monotonically  decreasing  in  both  qi 
and  f),  the  interaction  parameters.  We  do  this  by 
showing  that  ^  is  negative.  First,  a  definition; 

M«i)  =  (1  -4aaq^). 

If  we  differentiate  S  with  respect  to  we  arrive 
at  the  following  formula 

=  ♦  (-(-1  +  2a)’  -  2al„  +  (1  -  2a)./h(^)  , 

where  0  is  a  non-negative  function  of  and  a. 

In  order  to  show  that  is  negative,  we  must 
show  that 

-  (-1  -b  2a)’  -  2a7lqi  (1  -  2a)v/%i)  <  0.  (4) 

When  o  >  \  Equation  4  is  trivially  solved.  Our 
concern  is  in  the  range  0  <  a  <  in  which  case  our 
condition  becomes; 

—(—1  -♦-2a)’  —  2aa9i  -f-  (1  —  2a)^h(^i)  <  0 

-(-1  +  2a)’-2aff9,  <  -(1  -  2a)v^h(?,)  <  0 

(_(_l  +  2a)’-2aaf,)’  >  (-(1  -  2a)vA(^)’ 

Expanding  and  simplifying,  our  condition  reduces 
to 

4aV»i’  >  0. 

Since  a  and  ?  and  qi  are  all  non-negative,  the  in¬ 
equality  holds.  A  similar  (symmetric)  proof  for  q^  is 
omitted  here. 

Optimisation  with  respect  to  a  is  a  little  more  dif¬ 
ficult.  When  we  differentiate  5  with  respect  to  a  we 


get  such  a  complicated  formula  that  it  is  prohibitive 
to  solve  for  the  optimum  value  of  a.  Fortunately, 
by  plotting  S  versus  a,  qi  and  92  've  see  that  5  is 
unimodal  and  that  the  optimum  value  of  a  is  1/2 
(Ai  =  A2).  When  we  plug  this  value  (a  =  1/2)  into 
If  we  see  that  the  result  is  0. 


dS  _  2(-((i-h)q7)  +  qi(i-h))  n 

dai,=  ^  (l-?i)(l-?2) 

To  show  that  this  is  a  maximum  we  must  show  that 
the  second  derivative  is  negative  at  a  =  1/2.  This  is 
not  difficult,  though  we  omit  the  equations  here  for 
brevity.  For  the  more  general  case,  where  the  pro¬ 
cessors  are  not  restricted  to  single  step  advances,  the 
result  that  a  =  1/2  (Aj  =  A2)  for  optimal  perfor¬ 
mance  generalizes  to 


•^1  _ 

01  07 


or 


(5) 


meaning  that  the  average  “unfettered”  rate  of 
progress  in  virtual  time  for  each  processor  should  be 
the  same.  For  a  fixed  value  of  a  the  best  perfor¬ 
mance  can  be  found  when  Equation  5  is  true,  and 
overall  best  perfornumce  is  found  at  a  =  1/2  with 
Equation  5  holding  true.  Speedup  is  not  constant 
for  a/6  constant. 


6.2  Adding  a  Cost  for  State  Saving 

One  simple  way  of  examining  how  state  saving  over¬ 
head  affects  the  performance  of  the  system  is  to  mod¬ 
ify  the  value  of  Ri ,  the  rate  of  progress  on  a  single 
processor.  For  example,  if  we  examine  the  CD  model 
with  the  single  step  restriction  (as  above)  we  arrive 
at  the  following  value  for  Ri . 


_  c(Ai  -b  A2) 
2 


The  parameter  c  (c  >  1)  indicates  how  much  faster 
events  are  executed  without  state  saving.  If  c  =  2 
state  saving  doubles  the  amount  of  time  it  takes  to 
process  an  event.  For  the  CD  model  we  find  that 
the  new  formula  for  speedup  is  simply  1/c  times  the 
old  value.  Let  us  examine  a  very  simple  case  in  de¬ 
tail.  If  we  look  at  the  Syrrunetric,  Balanced  case,  the 
updated  formula  for  speedup  is 


c(2-bv/?)‘ 


It  is  easy  to  see  that  as  c  — >  00  speedup  will  go 
to  zero.  We  are  most  concerned  with  the  boundary 
where  5=1  which  is  the  transition  from  areas  where 
TW  on  two  processors  helps  to  where  it  hurts.  Set¬ 
ting  5=1  and  solving  for  q  we  find  the  necessary 
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condition  for  two  processors  running  TW  to  be  faster 
than  only  one. 


For  c  >  2  TW  with  two  processors  is  always  worse 
than  running  on  one  processor  without  TW.  Con¬ 
versely,  for  c  <  4/3  TW  wins  out.  The  interesting 
range  is  where  4/3  <  c  <  2.  In  this  range,  certain 
values  of  q  will  yield  speedup,  while  others  won’t. 
Figure  8  shows  the  regions  in  the  c  -  q  plane  where 
TW  on  two  processors  is  effective  and  where  it  is  not. 
Thus,  if  we  know  the  values  of  both  c  and  q  for  our 
Symmetric,  Balanced  system  we  can  immediately  tell 
whether  we  can  speed  up  the  application  by  running 
it  under  Time  Warp  on  two  processors. 


Figure  8;  Cost  for  state  saving  and  it’s  effect  on  per¬ 
formance. 

7  Future  Work  and 
Conclusions 

There  are  several  avenues  to  follow  for  future  work. 
One  is  to  add  message  queueing  to  our  model.  Cur¬ 
rently  any  message  that  arrives  in  the  future  is  ig¬ 
nored.  This  is  unrealistic  since  the  messages  in  TW 
actually  carry  some  work.  Another  addition  would 
be  to  charge  some  cost  for  rollback.  In  the  present 
model,  rollbacks  are  free  and  therefore  there  is  no 
penalty  for  speculative  computing.  We  have  ex¬ 
act  solutions  for  models  that  address  these  concerns 
and  they  will  appear  in  a  future  work  [1].  We  also 
would  like  to  extend  the  model  to  acconunodate  more 
than  two  processors.  Certainly,  an  exact  Markov 
chain  analysis  will  quickly  become  intractable.  We 
are  currently  investigating  extensions  of  our  present 
model  to  many  processors  without  using  a  compli¬ 
cated  Markov  chain  approach. 


In  this  paper  we  have  presented  a  model  for  two 
processor  Time  Warp  execution  and  provided  the 
results  of  its  exact  solution.  The  model  is  general 
enough  to  subsume  the  work  of  Lavenberg,  Muntz 
and  Samadi  [5]  and  to  partially  subsume  the  work 
of  Mitra  and  Mitrani  [9].  Further,  we  examined  a 
simplified  version  of  our  model  and  showed  for  op¬ 
timal  performance  that  the  processors  should  send 
as  few  messages  as  possible  and  that  their  indepen¬ 
dent  rates  of  progress  in  virtual  time  should  be  the 
same.  Finally,  we  addressed  the  cost  of  state  saving 
and  it’s  effect  on  performance.  Large  state  saving 
costs  or  frequent  message  interactions  indicate  that 
TW  is  ineffective  in  gaining  speedup.  The  detailed 
analysis  of  our  nx>del  and  logical  generalizations  to 
it  will  appear  in  future  works  [4][1]. 
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Abstract 


We  present  an  analytical  model  for  evaluating  the  per¬ 
formance  of  finite-buffered  packet  switching  multistage 
interconnection  networks  using  blocking  switches  un¬ 
der  any  general  traffic  pattern.  Most  of  the  previous 
research  work  has  assumed  unbuffered,  single  buffer  or 
infinite  buffer  cases,  and  all  of  them  assumed  that  ev¬ 
ery  processing  element  had  the  same  traffic  pattern  (ei¬ 
ther  a  uniform  traffic  pattern  or  a  specific  hot  spot  pat¬ 
tern).  However,  their  models  cannot  be  applied  very 
generally.  There  is  a  need  for  an  analytic^  model  to 
evaluate  the  performance  under  more  general  condi¬ 
tions. 

We  first  present  a  description  of  a  decomposition  tc 
iteration  model  which  we  propose  for  a  specific  hot 
spot  pattern.  This  model  is  then  extended  to  handle 
more  general  traffic  patterns  using  a  transformation 
method.  For  an  even  more  general  traffic  condition 
where  each  processing  element  can  have  its  own  traf¬ 
fic  pattern,  we  propose  a  superposition  method  to  be 
used  with  the  iteration  model  and  the  transformation 
method.  We  can  extend  the  model  to  account  for  pro¬ 
cessing  elements  having  different  input  rates  by  ad^ng 
weighting  factors  in  the  analytical  model. 

An  approximation  method  is  also  proposed  to  refine 
the  analytical  modd  to  account  for  the  memory  char¬ 
acteristic  of  a  bloddng  switch  whidi  causes  persistent 
blocking  of  padcets  conUnding  for  the  same  output 
ports.  The  analytical  model  is  used  to  evaluate  the 
uniform  traffic  pattern  and  a  very  general  traffic  pat¬ 
tern  "EPOS”.  Comparison  with  simulation  indicates 
that  the  analytical  model  is  very  accurate. 
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1  Introduction 

Packet-switching  Multistage  Interconnection  Networks 
(MIN)  have  been  proposed  for  applications  in  mul¬ 
tiprocessor  systems  for  interconnecting  a  large  num¬ 
ber  of  processing  elements  (PE)  and  memory  modules 
(MM),  and  also  for  fast  packet  switching  various  pack¬ 
ets  to  their  destination  ports  [21].  The  performance 
analysis  of  MIN’s  thus  becomes  an  important  issue. 

A  considerable  amount  of  performance  analysis  hats 
been  reported  on  clocked,  packet-switched  multistage 
interconnection  networks.  Example  networks  are  the 
Banyan  [5|  and  Omega  [13]  networks.  Most  of  the  pre¬ 
vious  research  was  limited  to  only  unbuffered  or  infi¬ 
nite  buffered  cases  with  a  uniform  traffic  pattern  or  a 
particular  hot  spot  traffic  pattern.  Dias  and  Jump  [4] 
and  Jenq  [7]  analyzed  the  single  buffer  case  under  a 
uniform  traffic  pattern.  Kim  and  Carcia  [8]  analyzed 
a  single  buffered  network  with  nonuniform  traffic  pat¬ 
terns.  Kruskal  and  Snir  [9],  [10],  [11]  analyzed  the  per¬ 
formance  of  banyan  networks  with  infinite  buffer  sizes 
under  uniform  and  a  particular  hot  spot  traffic  pattern. 
Szymanski  and  Shaikh  [19],  Yoon  et  al.  [23]  all  pre¬ 
sented  an  analytical  model  to  analyze  a  finite  buffered 
MIN  under  a  uniform  traffic  pattern  only.  Willick  and 
Eager  [22]  also  analyzed  the  infinite  buffered  case;  their 
model  is  able  to  analyze  any  given  general  traffic  pat¬ 
tern  except  that  they  need  a  special  purpose  analysis 
for  each  non-uniform  pattern.  Theimer  et  al.  [20]  pro¬ 
posed  a  tedmique  for  modelling  the  persistent  blocking 
behavior  of  a  single  buffered  network  with  a  uniform 
traffic  pattern. 

In  this  paper,  we  present  an  analytical  model  for 
ciodeed,  packet-swit^ed  multistage  interconnection 
networlu  built  with  2x2  blod^ing  switches.  The  model 
is  based  tm  a  decomposition  and  iteration  analysis  of 
the  Markov  chain  representing  the  output  queue  in 
each  switch.  The  goal  is  to  provide  a  simple  analyti¬ 
cal  model  which  can  be  applied  to  evaluate  the  per¬ 
formance  of  multistage  interconnection  networks  with 


Figure  1:  The  3  stage  Banyan  network  with  buffers  at 
output  ports  of  each  switch 

arbitrary  switch  sizes,  arbitrary  buffer  sizes,  arbitrary 
input  rates  for  each  processing  element  and  any  gen¬ 
eral  traffic  pattern. 

This  paper  is  organized  as  follows.  Section  2  de¬ 
scribes  the  model  assumptions  and  the  approach.  The 
decomposition  and  iteration  method  is  introduced 
with  a  unique  routing  probability  matrbc  to  repre¬ 
sent  the  steady  state  traffic  flow.  A  uniform  traffic 
pattern  and  a  special  hot  spot  pattern  are  discussed 
using  this  matrix.  A  transformation  method  is  intro¬ 
duced  in  Section  3  to  allow  any  general  traffic  pattern 
to  be  mapped  onto  the  routing  matrix  to  represent  the 
steady  state  flow.  A  superposition  method  is  also  in¬ 
troduced  to  allow  each  processing  element  to  have  its 
own  traffic  pattern.  The  resulting  memory  referencing 
pattern  is  even  more  general;  as  an  example  we  ana¬ 
lyze  the  Non-Uniform  TVaffic  Spots  (NUTS)  pattern 
[12].  A  weighting  factor  is  used  with  the  mapping  pro¬ 
cess  to  handle  the  case  when  processing  elements  are 
allowed  to  have  different  input  rates.  An  approxima¬ 
tion  method  is  introduced  in  section  4  to  account  for 
the  ’’memory”  characteristics  inherent  in  the  blocking 
scheme.  Results  are  verified  through  simulation  in  sec¬ 
tion  5.  Conclusions  are  given  in  section  6. 

2  Analytical  Model  for  Uniform 
Traffic  and  Hot  Spot  IVaffic 
Patterns 

2.1  Architecture  Description  and  As¬ 
sumptions 

In  this  paper,  the  interconnection  network  we  consider 
is  a  clocked,  packet-switched  finite-buffered  Banyan 
network  made  up  of  2x2  switdies,  eadi  of  whidi  has 
buffers  of  finite  size  K  at  their  output  ports  (see  Figure 
1).  There  are  N  processing  elements  and  N  memory 


modules  interconnected  by  the  n-stage  (i.e.  .V  =  2” 

)  interconnection  network.  All  the  operations  of  input 
and  output  take  place  at  the  end  of  each  cycle.  The  in¬ 
terconnection  network  accepts  requests  from  the  input 
nodes  (processing  elements),  then  routes  them  to  the 
output  nodes  (memory  modules).  Responses  to  these 
requests  are  returned  from  the  output  nodes  through 
the  interconnection  network  in  the  reverse  direction  to 
the  original  requesting  nodes.  The  ”  forward"  network 
and  "backward"  network  are  distinct,  but  are  identi¬ 
cal  in  topology.  It  is  sufficient  to  discuss  the  delay  and 
throughput  performance  of  the  forward  network  only 

Each  packet  generated  at  the  processing  elements 
carries  an  address  tag  with  a  number  of  bits  equal  to 
the  number  of  stages  of  the  interconnection  network. 
The  address  tag  is  a  binary  representation  of  the  desti¬ 
nation  address.  It  is  then  fed  into  the  first  stage  of  the 
network.  The  first  stage  switch  examines  the  first  fi  e. 
most  significant)  bit  of  the  address  tag;  if  it  is  a  0,  the 
packet  is  routed  to  the  queue  at  the  upper  output  port. 
If  the  first  bit  is  a  1,  the  packet  is  routed  to  the  queue 
at  the  lower  output  port  (see  Figure  1).  The  packet 
then  waits  in  the  queue  until  its  turn  to  be  served. 
The  routing  process  repeats  in  each  stage  to  choose  a 
proper  output  port.  A  blocking  switch  is  assumed  in 
which  if  a  head-of-queue  packet  cannot  go  to  the  next 
stage  due  to  a  full  buffer  or  a  “contention  failure"  for  a 
single  available  position,  it  stays  at  the  current  queue 
and  waits  for  the  next  cycle  to  try  again.  The  blocking 
phenomenon  has  an  implied  memory  characteristic  in 
that  a  blocked  packet  will  attempt  to  reach  the  same 
output  port  again.  This  memory  characteristic  makes 
the  analytical  modelling  difficult.  We  shall  discuss  an 
approximation  method  to  be  incorporated  into  the  an¬ 
alytical  model  to  model  this  memory  characteristic  in 
later  sections. 

When  two  packets  from  different  queues  in  the  same 
st^e  contend  for  the  same  output  queue  in  the  next 
stage,  a  contention  occurs.  If  there  are  more  than  two 
spaces  available  at  the  output  queue,  the  switch  is  as¬ 
sumed  to  be  fast  enough  to  accept  both  packets  in  one 
cycle.  If  there  is  only  one  space  available,  a  packet  is 
randomly  chosen  to  fill  up  this  space;  the  other  packet 
is  then  "blodted”  and  stays  at  the  original  queue.  How¬ 
ever,  if  no  space  is  available  in  the  next  stage  (i.e.  the 
queue  at  the  output  port  is  full),  then  both  packets 
are  blocked. 

Packets  are  assumed  to  be  of  the  same  length  (i.e. 
fixed  size  packets).  A  padcet  is  generated  by  each  pro¬ 
cessing  element  independently  with  probability  q  in 
eadi  cycle.  All  processing  elements  are  assumed  to 
have  this  identical  bernouUi  input  process.  This  as¬ 
sumption  is  later  relaxed  by  using  weighting  factors 
to  allow  eadi  processing  element  to  have  its  own  in- 


put  rate  (?j,  1  <  j  <  jV.  We  assume  that  there  is  no 
buffer  space  at  the  processing  elements.  After  being 
generated,  a  packet  is  discarded  if  it  cannot  be  deliv¬ 
ered  to  the  first  stage  of  the  interconnection  network 
either  due  to  a  full  buffer  or  a  contention  failure.  Dis¬ 
carded  packets  are  not  re-submitted.  A  packet,  once 
accepted  by  the  network,  is  never  discarded  inside  the 
network.  The  input  process  is  independent  of  the  dis¬ 
carding  process.  (An  extension  of  the  current  model 
to  allow  blocked  packets  to  be  stored  in  a  finite-sized 
queue  or  an  infinite  queue  is  underway.)  An  important 
performance  measure  is  the  total  time  a  packet  spends 
in  the  network.  Time  delay  is  meaningful  only  for  those 
packets  accepted  into  the  network.  The  probability  of 
acceptance,  another  performance  measure,  is  the  prob¬ 
ability  that  a  packet  is  accepted  into  the  network  after 
it  is  generated.  The  normalized  throughput  is  simply 
the  probability  of  acceptance  multiplied  by  the  input 
rate.  Current  work  also  includes  an  extension  to  the 
case  of  multiple  packet  generation. 

Each  processing  element  has  a  memory  module  ref¬ 
erencing  pattern.  A  referencing  pattern  is  the  set  of 
probabilities  with  which  a  packet  accesses  the  vari¬ 
ous  memory  modules.  All  previous  work  assumes  that 
processing  elements  have  the  same  referencing  pattern. 
We  shall  allow  PE’s  to  have  their  own  traffic  pattern  in 
Section  3.  The  memory  module  is  assumed  to  be  fast 
enough  to  accept  1  packet  per  cycle  from  switches  at 
the  last  stage.  This  fast  memory  module  assumption 
implies  that  there  is  no  blocking  at  the  last  stage  since 
a  dedicated  link  connects  1  memory  module  to  the 
output  queue  (see  Figure  1).  A  slower  memory  mod¬ 
ule  (e.g.  2  cycles  to  accept  a  packet)  will  have  a  severe 
effect  on  the  performance  of  the  network.  Extension  to 
slower  memory  models  is  underway. 

2.2  Routing  Model 

In  the  real  world,  the  packets  are  routed  according  to 
their  destination  address.  However,  in  order  to  ana¬ 
lyze  the  network  analytically,  an  abstract  flow  model 
that  can  be  used  in  an  analytical  modd  must  be  estab¬ 
lished  that  at  least  faithfully  reflects  the  steady  state 
flow  situation  in  the  network.  We  propoee  a  routing 
matrix  where  is  the 

routing  probability  of  tte  jth  input  port  in  suge  i.  A 
packet  entering  a  switdi  will  be  routed  either  to  the  up¬ 
per  output  queue  with  probability  Vij  or  to  the  lower 
output  queue  with  probability  1  -  lb  simulate  a 
uniform  traffic  pattern,  we  simply  let  all  rij  be  0.5. 
With  equal  probability  of  diooaing  output  queues,  no 
memory  module  is  preferred.  A  special  hot  spot  pat¬ 
tern  can  be  aeated  by  letting  all  be  an  idmtical 
value  greater  than  0.5.  Fbr  instance,  by  letting  all 


be  0.8  in  a  10  stage  network,  10.77c  (=  a'°  ,  of  the  t.j- 
tal  traffic  will  go  to  memory  module  0  in  a  102-1-node 
network  with  2.77o  of  the  traffic  going  to  the  second 
highest  referenced  memory  modules  (all  memory  mod¬ 
ules  with  a  single  1-digit  in  their  address  tag)  and  other 
fractions  of  traffic  to  the  other  memory  modules.  The 
advantage  of  this  routing  model  is  that  by  changing 
the  value  of  r,j  with  proper  mappings  from  real  traffic 
patterns,  we  can  evaluate  any  general  traffic  pattern. 
We  leave  the  general  to  be  discussed  in  Section  3. 
Throughout  this  section,  all  J  are  assumed  to  have 
the  same  value,  r,j  =  r. 


2.3  Analysis 

The  proposed  approximate  analytical  approach  em¬ 
ploys  a  decomposition  and  iteration  strategy.  The  real 
interconnection  network  is  in  fact  a  network  of  finite- 
buffered  queues  with  blocking.  The  dependency  among 
queues,  caused  by  the  blocking  from  stage  to  stage, 
makes  the  exact  analysis  intractable.  We  shall  use  a 
similar  approximation  technique  as  that  applied  in 
tandem  queues  with  blocking  [2],  [3]  and  [17]  where 
approximate  analyses  are  used.  The  approximation 
method  is  to  decompose  each  queue  in  the  tandem 
configuration  with  assumed  input  rates  and  blocking 
conditions.  The  exact  Markov  chain  is  then  solved  to 
find  the  corresponding  input  rates  and  blocking  con¬ 
ditions.  Each  decomposed  queue  is  analyzed,  then  the 
whole  process  is  repeated  until  it  converges,  if  it  is  to 
have  steady  state.  The  concept  of  using  this  decompo¬ 
sition  and  iteration  approximation  method  in  analyz¬ 
ing  the  flnite-buffered  Banyan  network  is  very  similar 
to  that  of  tandem  queues  except  that  instead  of  a  sin¬ 
gle  input  source  and  a  single  output  queue  for  each 
queue  in  the  tandem  configuration,  the  interconnec¬ 
tion  network  has  2  input  sources  and  2  output  queues 
for  each  queue  in  the  network  (except  the  last  stage 
queue  where  only  1  output  sink  is  presented,  namely, 
the  memory  module).  Therefore,  when  we  solve  for 
the  equivalent  input  rates  and  blocking  conditions  for 
a  decomposed  queue,  we  consider  the  combined  input 
from  2  input  sources  and  the  combined  probability  of 
blocking  from  the  2  output  queues.  The  approach  is  as 
follows : 

Let  Qij  represent  the  jth  queue  in  stage  i  and  {k) 
be  the  steady  state  probability  that  there  are  k  padeets 
in  the  queue  Qij.  Let  Qi-iji  and  Qi~i,j2  the  two 
input  sources  from  stage  i-1  that  feed  Qi^.  Let  A[i| 
be  the  probability  that  there  are  i  packets  destined  to 
Qij  from  its  two  input  sources.  In  the  following,  we 
solve  for  the  equivalent  input  rates  for  a  queue  Q,j 
whidi  is  located  at  output  port  0  : 


Figure  2:  Markov  chain  of  a  queue  extracted  from 
the  network  where  the  state  variable  represents  the 
number  of  packets  in  that  queue 


A'[l]  =  2r(l  -  r)(l  -  F._,ji(0))(l  -  P.-ij2(0))+ 

^[2]  =  [r(l  -  P.-i.ji(0))]  •  (r(l  -  P._ijj(0))] 

;f[0)=l-^(lI-X[2]  (1) 

The  first  term  in  the  Jf[l]  equation  corresponds  to 
the  case  that  both  queues  in  stage  i-1  are  not  empty, 
and  one  chooses  output  port  0  with  probability  r  and 
the  other  chooses  another  output  port  with  probabil¬ 
ity  1  -  r.  The  second  term  corresponds  to  the  case 
that  one  queue  in  the  previous  stage  is  empty,  and  the 
other  is  not.  The  non-empty  one  chooses  the  output 
port  0  with  probability  r.  The  summation  of  probabil¬ 
ities  in  both  cases  represents  the  probability  that  only 
one  input  packet  feeds  the  queue.  The  X[2\  equation 
represents  the  case  when  both  queues  in  the  previous 
stage  are  not  empty  and  they  both  choose  output  port 
0  with  probability  r. 

Regarding  the  equivalent  blocking  condition,  let  Bij 
be  the  probability  that  a  packet  in  the  jth  queue  in 
stage  i  is  blocked  at  the  end  of  the  cycle.  Let  Cij  be  the 
probability  that  the  jth  queue  in  stage  i  is  blocking  a 
packet  in  stage  i-1.  LetQt4.iji  andQi.t>ij3bethetwo 
output  queues  of  Qij  and  let  be  the  queue  that 
feeds  both  Qi.».iji  and  Then  the  equivalent 

blocking  condition  for  queue  Qij  is  as  follows  : 

Bij  =  r  ■  Ci+i +  (I  -  r)  •  Ci4.ij3 

»  (if ) + 5  •  (1  -  Pw(0))  •  (if  - 1) 

(2) 

The  first  term  in  the  Bij  equation  rq)resents  the  case 
when  the  packet  at  the  of  queue  Qij  chooses 
with  probability  r  and  is  blocked  by 
The  second  term  represents  the  other  case  when  ^e 
packet  chooses  Bi+ij2  end  is  blocked.  There  are  two 
situations  in  whidi  a  queue  blocks  a  padcet  in  the  pre¬ 
ceding  stage  :  firstly,  when  the  queue  is  full,  and  sec¬ 
ondly,  when  the  queue  has  only  one  more  space  and  a 
contention  from  Qij  wins  the  arbitration. 


Given  a  set  of  initial  values  for  the  variables  of 
the  network,  we  "extract”  queue  Qi,,  from  the  net¬ 
work  (with  the  equivalent  input  rates  and  block¬ 
ing  conditions  as  exist  in  the  network)  as  an  inde¬ 
pendent  queue.  The  Markov  chain  for  this  queue  is 
then  solved  to  get  new  values  for  the  state  proba¬ 
bilities.  A  sample  Markov  chain  for  the  queue  (J,.; 
with  buffer  size  4  is  shown  in  Figure  2  where  B  rep¬ 
resents  the  blocking  probability  We  repeat  this 
process  for  other  queues  in  the  first  stage,  in  the  or¬ 
der  Using  these  new  state  proba¬ 

bilities  as  the  new  input  rates,  we  repeat  the  same 
process  for  all  queues  in  the  second  stage  in  the  or¬ 
der  Q2,i,Q2,2,—Q7,n-  This  process  is  repeated  for  all 
stages.  Now  we  have  a  new  set  of  values  for  the  net¬ 
work  variables  which  can  be  used  to  compute  the  new 
input  rates  and  blocking  probabilities.  This  new  set 
of  values  is  used  in  the  next  iteration  to  compute  an¬ 
other  set  of  new  values,  etc..  The  iteration  process  is 
repeated  until  the  difference  between  two  con.secutive 
iterations  is  below  10~*. 

The  performance  measures  that  are  of  interest  are 
the  probability  of  acceptance,  the  normalized  through¬ 
put  and  the  average  time  delay.  There  are  two  ways  to 
calculate  the  probability  of  acceptance.  If  we  sum  the 
output  rate  over  all  output  ports  and  divide  it  by  the 
total  input  rate,  we  get  the  probability  of  acceptance  : 


PAoui  * 

N  X  q 


(3) 


The  total  output  rate  over  the  input  rate  is  the  proba¬ 
bility  of  acceptance  at  the  output  port.  From  the  input 
port,  we  solve  for  the  probability  that  a  packet  gen¬ 
erated  at  the  PE’s  is  discarded  due  to  a  full  buffer  or 
a  contention  failure  at  the  first  stage.  This  discard¬ 
ing  probability  is  Boj  ,  which  can  be  solved  for  using 
equation  (2).  Hence, 


PAin  =  1  -  Bo,i  (4) 

Both  values,  although  solved  in  different  weys,  should 
be  equal  when  the  MIN  readies  steady  state.  (This 
can  be  used  to  test  for  the  correctness  of  the  model.) 
The  normalized  throughput  is  found  by  multiplying 
the  probability  of  acceptance  by  the  input  rate.  We 
apply  Little’s  result  to  calculate  the  average  time  de¬ 
lay  of  a  packet.  When  the  network  readies  steady  state, 
we  take  the  sum  of  the  mean  queue  size  for  the  whole 
network  using  the  steady  state  probabilities  of  queue 
size  of  each  queue.  Givoi  the  throughput  and  the  av¬ 
erage  numbv  of  customers  in  the  system,  the  average 
time  delay  can  be  solved  for  by  applying  Little’s  result. 


2.4  Results 


The  analytical  results  of  a  hnite-bulTered  multistage 
interconnection  network  under  a  uniform  traffic  pat* 
tern  and  a  hot  spot  pattern  were  shown  in  [14].  Fbr 
various  traffic  loads  and  various  network  sizes  (from  a 
single  stage  to  a  9-stage  Banyan),  the  improvement  of 
the  probability  of  acceptance  by  adding  buffers  (from 
unbuffered  to  buffer  size  8)  were  shown  for  both  the 
uniform  traffic  pattern  and  various  hot  spot  traffic  pat¬ 
terns.  Another  experiment  was  shown  for  a  9-stage, 
8-buffered  Banyan  where  the  average  busy  buffer  size 
was  calculated  in  each  stage.  By  varying  the  offered 
load  and  the  degree  of  hot  spots  (0.5  <  r  <  l.O),  we 
can  see  how  the  average  busy  buffer  size  grows  ac¬ 
cording  to  different  situations.  The  average  time  delay 
for  a  9-stage,  8-buffered  Banyan  network  was  shown 
with  various  offered  loads  and  various  degrees  of  hot 
spots.  A  tree  build-up  time,  defined  as  the  time  for 
the  saturated  tree  [18]  to  build  up,  was  discussed  and 
a  method  for  calculating  the  upper  bound  was  shown 
using  the  analytical  model.  The  analytical  model  can 
also  be  extended  [14]  to  analyze  the  performance  of 
the  combining  network  suggested  in  the  NYU  Ultra 
computer  project  [6]  and  [18].  The  effectiveness  of  a 
combining  switch  was  analyzed  against  various  offered 
loads  and  queue  sizes  for  a  9-stage  combining  network. 
It  was  shown  that  the  combining  switch  works  only 
when  there  is  a  ”hot”  memory  cell  iitside  the  hot  spot 
memory  module. 


3  Analytical  Model  for  General 
Traffic  Conditions 

In  this  section,  we  consider  very  general  traffic  con¬ 
ditions  where  not  only  a  general  traffic  pattern  is 
allowed,  but  also  each  processing  element  can  have 
its  own  traffic  pattern  and  its  own  input  rate.  The 
basic  decomposition  and  iteration  model  remains  as 
the  main  modeling  approach.  The  additional  analysis 
which  is  needed  for  general  traffic  conditions  is  reduced 
to  finding  the  proper  representation  of  the  steady  state 
traffic  flows  in  terms  of  the  routing  probabilities 

3.1  Model  AMumptions 

We  still  consider  a  clocked,  flnite-buffered  multistage 
interconnection  network  as  discussed  in  the  previous 
section.  All  assumptions  made  in  section  2  remain  the 
same  except : 


\ . 
A  . 

'^2 

Aj 

A. 
A  5 


A6 

A, 


Figure  3:  A  General  memory  referencing  pattern 
shown  in  terms  of  accessing  probabilities  A; 


•  Each  processing  element  can  have  its  own  refer¬ 
encing  pattern. 

•  Each  processing  element  can  have  its  own  input 
rate. 

These  three  different  assumptions  represent  differ¬ 
ent  levels  of  general  traffic  patterns.  We  shall  discuss 
the  modeling  approaches  for  these  three  different  as¬ 
sumptions  in  the  next  subsections. 

3.2  Identical  General  IVafflc  Patterns 
for  the  Processing  Elements 

A  traffic  pattern  that  can  be  in  any  form  implies  that 
the  r,^  ’s  in  the  routing  matrix  no  longer  have  the  same 
value  r  as  discussed  in  the  previous  section.  The  ap¬ 
proach  to  model  this  general  traffic  pattern  is  to  find 
a  mapping  scheme  that  transforms  the  given  referenc¬ 
ing  pattern  into  a  set  of  ri^’s  which  reflects  the  steady 
state  traffic  flow  in  the  network. 

Let  us  take  a  3  stage  Banyan  network  as  an  exam¬ 
ple,  as  shown  in  Figure  3.  Since  we  assume  that  all 
processing  elements  have  the  identical  general  traffic 
pattern,  we  only  discuss  the  transformation  method  for 
one  processing  element.  If  there  exists  a  steady  state 
referencing  pattern,  we  can  represent  it  in  terms  of 
destination  accessing  probabilities  Aj,  the  probability 
that  a  new  packet  generated  by  a  processing  element 
chooses  mem<^  module  j  as  its  destination.  Consider 
a  packet  generated  by  processing  element  0  and  ob¬ 
serve  the  path  it  takes  as  it  travels  through  the  net¬ 
work  to  access  the  memory  modules.  A  packet  chooses 
memory  module  0  with  probability  Aq  which  equals 
ru  -rai  -rsi.  Similarly,  a  packet  chooses  memory  mod¬ 
ule  1  with  probability  Ai  *  rn  •  rji  •  (1  -  r^i).  Using 
these  two  equations,  we  find  rsj  in  terms  of  Aq  and 
Ai. 

a  ^ 

Ao  +  Ai 


•  Any  general  memory  referencing  pattern  is  al¬ 
lowed. 


The  other  routing  Probabilities  can  be  found  in  a  sim¬ 
ilar  way  : 

^3*  *  - 7- 
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Since  there  is  only  one  traffic  pattern  for  all  processing 
elements,  this  routing  probability  set  is  valid  for  all 
other  processing  elements. 

This  transformation  method  is  simply  a  mapping 
of  the  memory  referencing  pattern  onto  a  set  of  rout¬ 
ing  probabilities.  We  incorporate  the  transformation 
method  into  our  analytical  model  approach  in  section 
2.  When  we  calculate  the  equivalent  input  rates  and 
blocking  probabilities,  we  replace  the  value  r  in  equa¬ 
tions  (l)-(2)  with  the  proper  values.  Thus,  the  de¬ 
composition  and  iteration  model  we  proposed  in  sec¬ 
tion  2  can  be  extended  to  analyze  a  multistage  inter¬ 
connection  network  with  any  general  traffic  pattern. 

3.3  Different  General  IVaffic  Patterns 
for  Processing  Elements 

For  more  generality,  we  allow  each  processing  element 
to  have  its  own  general  traffic  pattern.  In  practice,  it 
is  quite  possible  that  the  traffic  requirement  for  each 
processing  element  is  different  from  the  others.  Let  us 
assume  that  they  all  have  the  same  input  rate.  The 
modelling  approach  for  this  case  is  to  find  a  proper  set 
of  routing  probabilities  Vij  whidi  reflects  the  combined 
traffic  flow  in  steady  state.  A  superposition  method  is 
proposed,  in  addition  to  the  use  of  the  transformation 
method,  to  ilnd  this  proper  set  of  routing  probabilities. 
We  first  apply  the  trauformation  method  to  all  pro¬ 
cessing  elements  to  transform  their  general  memory 
referencing  patterns  into  N  sets  of  routing  matrices. 
Since  all  processing  dements  have  the  same  input  rate, 
we  simply  take  the  mean  value  of  these  N  sets  of  rout¬ 
ing  probabilities  to  find  a  routing  matrix  that  reflects 
the  combined  traffic  flows  in  steady  state.  This  rout¬ 
ing  matrix  is  used  in  the  decompostion  and  iteration 
model.  Thus,  the  analytical  modd  can  be  extended  to 
analyze  the  case  where  eadi  processing  element  has  its 
own  traffic  pattern. 


3.4  Different  Input  Rate  and  Different 
TVaffic  Patterns  for  the  Processing 
Elements 

In  many  cases,  the  processing  elements  might  have  dif¬ 
ferent  packet  input  rates.  This  presents  the  most  gen¬ 
eral  traffic  condition.  The  modelling  approach,  again, 
is  reduced  to  finding  the  proper  representation  of  the 
routing  matrix. 

A  slower  source  contributes  less  to  the  steady  state 
flows.  Hence  we  must  determine  a  weighting  factor  for 
each  source  to  reflect  its  contribution  to  the  steady 
state  traffic.  The  ideal  weighting  factor  is  the  input 
rate  of  each  source.  When  we  take  the  mean  of  the 
N  routing  matrices,  we  then  weight  their  contribution 
according  to  their  input  rates.  If  we  incorporate  this 
weighting  factor,  the  model  can  be  extended  to  analyze 
the  case  where  each  processing  element  has  its  own 
traffic  pattern  and  its  own  input  rate. 

3.5  Results 

Since  the  proposed  analytical  model  employs  several 
approximate  methods,  it  is  important  to  study  how 
these  approximations  affect  the  model  accuracy.  There 
are  two  approximations  in  the  modelling  approach  : 

•  decomposing  a  queue  from  a  network  of  queues 
with  blocking  into  an  independent  queue. 

•  using  a  general  routing  matrix  to  model  the 
steady  state  flows. 

The  first  approximation  is  obvious  since  dependent 
queues  are  decomposed  into  equivalent  independent 
queues  and  solved  individually.  Some  accuracy  is  lost 
because  our  model  neglects  the  dependency  and  cou¬ 
pling  among  the  queues.  The  second  approximation 
allows  packets  to  (ffioose  their  output  ports  every  cy¬ 
cle  independently  according  to  the  routing  probabili¬ 
ties,  instead  of  the  real-world  address  tag.  This  renewal 
routing  choice  allows  a  blodced  packet  to  choose  a  dif¬ 
ferent  output  port  in  the  next  cycle.  This  renewal  as¬ 
sumption  renders  the  analytical  model  optimistic  since 
it  "allows”  blocked  padcets  to  be  routed  around  a  con¬ 
gested  queue.  In  the  real  world,  blocked  packets  re¬ 
peatedly  access  the  same  destination,  and  most  likely, 
these  blodced  packets  will  be  blodced  again  (especially 
when  the  traffic  is  not  uniform). 

The  EFOS  (Even-First-Odcl-Second)  pattern  was 
proposed  in  [12]  using  an  Omega  network  where  even 
addressed  processing  elements  s«id  all  their  traffic  to 
the  first  half  of  memory  modules  uniformly  while  the 
odd  addressed  ones  send  their  traffic  to  the  second  half 
of  memory  modules  uniformly.  The  destination  traffic 


Figure  4:  Comparison  of  results  for  a  6  stage,  4* 
buffered  Omega  network 

distribution  looks  uniform,  but  there  are  severe  con¬ 
tentions  for  the  common  paths  inside  the  network.  It 
is  an  example  of  our  general  traffic  pattern  where  there 
are  two  traffic  patterns  for  the  processing  elements.  A 
6  stage  Omega  network  with  4  output  buffers  at  each 
switching  element  was  evaluated  under  both  a  uniform 
traffic  pattern  and  an  EPOS  traffic  pattern.  The  ana¬ 
lytical  results  we  obtained  are  plott^  against  simula¬ 
tion  results  in  Figure  4.  As  predicted  ,  the  analytical 
model  is  very  optimistic  due  to  the  independent  rout¬ 
ing  choices  it  allows.  When  severe  blocking  is  present 
due  to  contention,  the  blocked  packets  will  choose  the 
same  output  queues  repeatedly  in  the  real  world  while 
the  renewal  choice  in  the  analytical  model  allows  the 
blocked  packets  to  choose  other  queues.  This  inher¬ 
ited  ’’memory”  structure  in  blocking  switches  severely 
degrades  the  performance  since  it  is  likely  to  have  per¬ 
sistent  contention  for  a  queue  once  contention  occurs. 
The  discrepancy  between  analytical  and  simulation 
data  is  caused  mainly  by  this  memory  characteristic 
of  the  blocking  switch.  We  propose  an  improvement  in 
the  next  section  to  model  this  "memory”  behavior  of 
a  blocking  switch. 

4  Analytical  Model  for  a  Block¬ 
ing  Switch  with  Persistent 
Blocking 

4.1  Model  Approach 

Since  the  basic  model  is  a  renewal  process,  we  continue 
to  model  the  memory  behavior  as  a  renewal  process. 
However,  the  behavior  of  a  blocked  packet,  after  its 
first  blocking,  is  sudt  that  the  routing  choice  no  longer 
uses  the  renewal  probability  Biasing  the  routing 
probabilities  to  account  for  this  does  not  help  since  it 
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Figure  5:  The  states  of  a  server  during  its  busy  period 

changes  the  memory  referencing  pattern.  The  routing 
probabilities  were  created  to  reflect  the  steady  state 
memory  referencing  pattern;  therefore,  it  is  necessary 
to  keep  the  values  unchanged. 

Although  an  exact  model  of  this  persistent  block¬ 
ing  behavior  would  require  that  we  keep  track  of  how 
many  times  a  packet  has  been  blocked  at  a  given  node, 
we  choose  an  approximation  which  captures  the  ”  first 
order"  effect  of  this  persistence  using  the  following  two 
state  model.  When  the  queue  is  not  empty  :  we  model 
the  server  as  being  either  in  the  ’’new”  state  or  the 
” blocked”  state.  When  a  packet  first  comes  into  the 
server,  the  server  is  in  the  new  state.  The  server  enters 
the  blocked  state  when  the  packet  is  blocked,  and  it 
remains  in  the  blocked  state  until  the  blocked  packet 
finally  goes  through  to  the  next  stage.  This  cycle  re¬ 
peats  until  the  server  empties  the  queue  and  becomes 
idle.  Observe  that  the  server  is  inactive  when  it  is  in 
the  blocked  state.  While  in  the  new  state,  the  server 
obeys  the  renewal  behavior  choosing  an  output  port 
according  to  the  routing  probability  r,j.  Hence,  we  can 
approximate  a  blocking  switch  with  ’’memory”  charac¬ 
teristics  by  a  finite  buffer  queue  with  a  reduced  service 
rate.  The  reduced  portion  is  the  probability  that  the 
server  is  in  the  blocked  state. 

The  diagram  in  Figure  5  shows  how  the  server  al¬ 
ternates  between  the  new  state  and  the  blocked  state 
during  its  busy  period.  Let  6  be  the  probability  that 
a  new  packet  is  blodced  when  it  tries  to  go  to  the 
next  stage.  Let  c  be  the  probability  that  a  blocked 
packet  is  blocked  again  when  it  tries  to  go  to  the  same 
destination.  Then  for  our  ^>proximation,  the  steady 
state  probability  that  the  server  is  in  the  blocked  state, 
AioetaSi  can  be  solved  in  terms  of  6  and  c 

I  -  c  -h  6 

where  b  is  the  bloddng  probability  (for  which  we  used 
the  notation  Bij  in  sectimi  2.3).  Once  blocked,  it  is 
more  likely  that  a  blodced  padcet  gets  blocked  again; 
therefore,  the  value  of  c  is  selected  to  be  larger  than 
the  value  of  6.  In  fact,  when  a  packet  is  in  the  blocked 
state,  the  length  of  the  destination  queue  in  the  next 
cycle  will  be  either  K  (full)  or  K-1  (only  one  space 
available).  If  we  disregard  how  many  times  it  has  been 


blocked  previously,  there  will  be  only  two  esses  :  ei¬ 
ther  the  blocked  pseket  fsoes  s  full  queue  or  s  queue 
with  one  space  left.  In  the  first  case,  with  probability 
packet  will  be  blocked  again.  In  the 
second  case,  with  probability  the  packet 

will  face  possible  contention  from  the  other  queue  in 
the  same  stage  which  feeds  this  destination  queue.  In¬ 
corporating  these  two  probabilities  in  equation  (2),  c 
can  be  found  in  a  similar  way  ; 

c  =  r  •  C,^.i,ji  +  (1  -  r)  ■ 

The  probability  that  the  jl-th  queue  in  stage  i-i-l  is 
blocking  a  packet  in  stage  i,  Coiji,  is  : 


are  similar  to  the  ones  in  section  2.3  ; 

^(1)  «  2r(l  -  r)(l  - 

m  =  lr(l  -  P.f f^i(0))l .  (r(l  -  pri/,,2(0))l 

X[01  =  1  -  X[l]  -  X[2] 

=  1  -  (1  -  Pt-l.jlMocked)  ■  (1  -  P._i,ji(0)) 

®  1  -  (I  -  P*-l.]3,Uoektd)  •  (1  -  P,-l,j2(0)) 

The  equivalent  blocking  probability  can  be 
found  as  follows  ; 


_ Pi+l,jl(f(  ~  I) _ 


Pij  —  1"  •  Ci^ijx  +  (1  —  r)  ■ 


Phioeked  is  the  probability  that  the  server  is  in  the 
blocked  state.  During  this  period,  the  server  is  inac¬ 
tive.  Therefore,  we  may  use  this  probability  to  approx¬ 
imate  the  blocking  switch  with  "memory”  characteris¬ 
tic.  At  the  beginning  of  each  cycle,  the  server  tosses  a 
coin  which  comes  up  heads  with  probability  Phioektd, 
in  which  case  the  server  will  be  blocked  (inactive).  If 
there  is  a  packet  at  the  server,  it  stays  ide  until  the 
next  cycle  when  the  coin  will  be  tos^  again.  With 
probability  1  -  Puoektd,  the  server  will  be  active.  I'he 
queue  length  then  determines  whether  the  server  will 
send  a  packet  or  not.  If  there  are  packets  in  the  queue, 
the  server  takes  the  first  packet  and  routes  it  according 
to  the  routing  probability. 

Incorporating  the  probability  f^iodud  into  our  pre¬ 
vious  model,  the  approach  is  then  similar  except  that 
the  equivalent  input  rates  and  blodting  probabilities 
are  different.  In  the  original  model,  when  a  queue  is 
not  empty  (with  probability  1  -  PijiO)),  it  tries  to 
transmit  a  packet  to  the  destination  in  stage  i-f-1.  How¬ 
ever,  for  the  persistent  blocking  model .  a  queue  tries 
to  transmit  a  packet  to  the  next  stage  with  probability 
( I  ~ ^io(O)) ' (I  ** l\u^ktd)t  th*  former  is  the  probability 
that  the  server  is  not  empty  and  the  latter  is  the  prol^ 
ability  that  the  server  is  in  the  "active”  state.  When 
the  server  is  not  empty  and  it  is  active,  it  transmits  a 
packet  to  the  next  stage. 

Let  us  define  P^J^(0)  to  be  the  effective  probability 
that  Qtj  will  not  send  a  padeet  («ther  the  server  is 
empty  or  the  server  is  not  empty  and  is  blodced).  Let 
Pij,M0ttM4  be  the  probability  that  Qij  is  not  empty 
and  is  in  the  blodced  state.  Then  the  effective  input 
rates  of  a  queue,  Qij  in  this  persistent  blodcing  model 


C.+1JI  =  P.^,  (/f )-K ^  •(!  -  ( A  -  1 ) 

Incorporating  these  equivalent  input  rates  and 
blocking  probabilities  into  the  previous  model,  we  can 
evaluate  a  multistage  interconnection  network  with 
persistent  blocking  behavior.  We  first  calculate  the 
equivalent  input  rates  and  blocking  probabilities  for 
each  queue  in  the  first  stage.  Then  we  decompose  the 
first  queue  in  the  first  stage  and  solve  its  Markov  chain 
with  the  equivalent  input  rates  and  blocking  probabil¬ 
ities.  The  remaining  queues  in  the  first  stage  are  de¬ 
composed  and  their  Markov  chains  are  solved  one  by 
one.  The  steady  state  probabilities  for  these  queues 
are  used  to  calc^ate  the  equivalent  input  rates  to  the 
queues  in  the  second  stage.  The  queues  in  the  second 
stage  are  then  decomposed  and  solved.  This  process  is 
repeated  for  all  stages.  The  analytical  model  iterates 
this  decomposition  process  until  the  throughput  con¬ 
verges.  Then  we  calculate  the  time  delay  using  Little’s 
result.  Other  performance  measures  can  be  computed 
using  the  steady  state  parameters  of  the  system. 

4.2  Results 

We  ran  our  model  incorporating  this  new  technique 
to  handle  the  memory  behavior  for  the  same  6-stage 
Omega  network  (as  in  Section  3.5)  with  buffer  size  4 
under  both  the  uniform  traffic  and  the  EPOS  traffic 
pattern.  The  result  is  shown  in  comparison  with  the 
former  model  in  Figure  6.  The  improved  model  greatly 
reduces  the  discrepancy  between  the  simulation  and 
analytical  model  results. 

Fbr  a  detailed  study,  we  compare  the  improved  an¬ 
alytical  results  with  simulation  in  Figures  7-14.  The 
confidence  range  of  these  simulations  is  95%.  The  first 
case  shown  in  Figures  7-8,  is  for  a  4-buffercd,  6  stage 
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Figure  6:  Comparison  of  results  for  a  6-stage,  4- 
buffered  Banyan  network  with  and  without  the  "mem¬ 
ory”  behavior  improvement 

Omega  network  with  a  uniform  traffic  pattern.  In  Fig¬ 
ure  7,  the  throughput  is  compared  with  various  offered 
loads  and  in  Figure  8,  the  average  time  delay  is  com¬ 
pared.  The  offered  load  is  varied  from  0.1  pkt/cycle 
to  1.0  pkt/cycle  for  each  processing  element.  Fbr  of¬ 
fered  loads  within  the  range  of  small  loss  probability 
(q  <  0.7  pkt/cycle),  the  simulations  verify  the  accu¬ 
racy  of  the  analytical  results.  Beyond  this  load  (when 
packets  begin  to  be  discarded),  the  analytical  results 
are  slightly  optimistic. 

Figures  9-10  show  the  case  of  a  4'buffered,  6 
stage  Omega  network  with  an  EFOS  traffic  pat¬ 
tern.  Throughput  and  average  time  delay  are  plotted 
against  offered  load.  The  non-uniformity  of  this  pat¬ 
tern  severely  degrades  the  performance.  The  through¬ 
put  graph  shows  very  good  correspondence  between 
analytical  results  and  simulations.  For  offered  load 
within  the  low-loss  range  (f  <  0.4  pkt/cycle),  delay 
performance  of  the  analyticid  result  is  very  accurate. 
However,  delay  performance  of  analytical  results  are 
still  slightly  optimistic  in  heavy  load  cases. 

The  third  case  is  included  to  determine  whether  the 
analytical  model  performs  well  with  a  larger  buffer. 
The  results  for  an  8>buflared,  6  stage  Omega  network 
with  a  uniform  traillc  pattern  are  shown  in  Figures 
11-12.  Except  with  hM^  load  (q>0.9  and  1.0),  an¬ 
alytical  results  meeraie  well  whm  compared  to  sim¬ 
ulations.  The  throughput  and  <May  performance  are 
optimistic  when  the  total  input  enters  the  range  of 
heavy  load.  This  simulation  indicates  that  the  analyti¬ 
cal  model  performs  well  for  networks  with  other  buffer 
sizes. 

We  show  the  analytical  results  of  a  large  sized  net¬ 
work  in  Figures  13-14,  namely  a  i-buffered,  10  stage 
(1024x1024)  Omega  netwwk  with  a  uniform  traffic 


Figure  7:  Throughput  comparison  for  a  4-bufTered,  6 
stage  Omega  MIN  with  a  uniform  traffic  pattern 


Figure  8:  Mean  delay  comparison  for  a  4-bufTcred,  6 
stage  Omega  MIN  with  a  uniform  traffic  pattern 


Figure  9:  Throughput  comparison  for  a  4-bufTered,  6 
stage  Omega  MIN  with  EFOS  traffic  pattern 


Figure  10:  Mean  delay  comparison  for  a  4-bufTcred,  6 
stage  Omega  MIN  with  EFOS  traffic  pattern 
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Figiire  11:  Throughput  comparison  for  an  S-buffered, 
6  stage  Omega  MIN  with  a  uniform  traffic  pattern 


Figure  12:  Mean  delay  comparison  for  an  8-buffered,  6 
stage  Omega  MIN  with  a  uniform  traffic  pattern 


Figure  13;  Throughput  comparison  for  a  4-bufTered,  10 
stage  Omega  MIN  with  uniform  traffic  pattern 


Figure  14:  Mean  delay  comparison  for  a  4-bufTered,  10 
stage  Omega  MIN  with  uniform  traffic  pattern 


pattern.  Again,  the  analytical  model  is  slightly  opti¬ 
mistic  when  the  offered  load  exceeds  the  maximal  at¬ 
tainable  output.  This  indicates  that  the  model  is  suit¬ 
able  for  large  sized  networks  as  well. 

4.3  Conclusion 

An  analytical  model  was  developed  to  evaluate  the 
performance  of  finite- buffered  Multistage  Interconnec¬ 
tion  Networks  with  any  general  traffic  pattern.  The 
Interconnection  Network  was  modelled  as  a  network 
of  finite-buffered  queues  with  blocking.  We  proposed 
a  decomposition  and  iteration  method  to  analyze  this 
network  of  queues  with  blocking.  A  transformation  and 
superposition  method  was  then  proposed  to  analyze 
these  networks  for  any  general  traffic  pattern.  How¬ 
ever,  the  renewal  routing  choice  assumption  caused  the 
modeling  results  to  be  optimistic.  Therefore,  we  pro¬ 
posed  approximate  techniques  to  model  the  persistent 
blocking  situation.  The  analytical  results  were  com¬ 
pared  to  the  simulation  in  different  buffer  sizes  and 
different  network  sizes  under  both  a  uniform  traffic 
and  a  general  traffic  pattern.  The  simulations  show 
that  the  analytical  results  are  very  accurate  when  the 
offered  load  does  not  exceed  the  maximal  attainable 
total  output.  For  cases  where  the  offered  load  is  beyond 
the  maximal  attainable  output,  the  analytical  results 
are  slightly  optimistic.  This  verifies  the  accuracy  and 
the  flexibility  of  the  analytical  model. 
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The  Latency/Bandwidth  Tradeoff 
in  Gigabit  Networks 

Gigabit  networks  really  are  different! 
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^he  bandwidth  for  data  communica* 
tions  has  been  growing  steadily  and 
dramatically  over  the  la^twentyyears. 
Some  of  us  remember  the  earty  days 
of  data  modems  which  provided 
access  speeds  of  lOcharacters  per  sec¬ 
ond  (cps)  in  the  late  1960s.  When  300  baud  speeds 
became  available  (providing  30  cps),  we  thought 
of  it  as  a  major  improvement  (and  it  was). 

In  the  mid-TOs,  as  packet-switched  networks  [1] 
began  to  proliferate,  we  saw  the  standard  set  at 
64  kb/s  trunk  speeds;' of  course,  by  the  time  one  paid 
for  the  software  and  protocol  overhead,  we  were 
happy  to  end  up  with  about  lOkb/s  file  transfer  speeds 
(by  now,  the  dial-up  data  modem  speeds  had  reached 
2400  bits  per  second).  The  killer  application 
which  drove  the  penetration  of  these  X.25  net¬ 
works  was  that  of  transaction  processing. 

In  the  1980s  we  witnessed  the  proliferation  of 
T1  channelspeeds,  providing  1.533megabitper  sec¬ 
ond  (Mb/s)  tmnk  speeds.  Private  T1  netwcnksejq^- 
ed  in  the  1980s  because  of  the  cost  savings  they 
provided  by  allowing  corporations  to  integrate  their 
voice  and  data  networks  into  a  single  network. 
This  was  the  killer  application  for  corporate  T1 
networks.  In  the  scientificoonimunity,Tl  was  intro¬ 
duced  toward  the  end  of  the  1980s  due  to  the 
killer  applications  of  e-mail  and  file  transfers;  the 
load  from  these  applications  arose  very  quickly 
due  to  the  enormous  growth  in  the  number  of 
connected  users.  However,  the  packet-switched  net¬ 
works  still  had  64  kb/s  backbone  speeds  due 
largely  to  the  complex  operations  the  switches 
were  required  to  carry  out;  specifically,  each 
switch  had  to  process  every  packet  up  to  the  third 
layer  (the  netw^  layer)  of  tlwseven-layer  OSI  archi- 
tecture.[2] 

As  we  entered  the  1990s,  we  saw  a  grass  roots 
development  in  the  form  of  Frame  Relay  net¬ 
works  [3,4].  These  nets  offer  packet  switching  at 
TI  speeds,  a  significant  step  above  the  64  kb/s 
packet  switching  nets  of  the  1980s.  Both  hard¬ 
ware  and  software  developments  led  to  these 
higher  speed  packet  switched  networks.  On  the  hard¬ 
ware  side,  the  widespread  deployment  of  fiber 
optic  communication  channels  by  the  long-haul  car¬ 


riers  was  critical.  Besides  having  enormous  band- 
widths,  these  fiber  optic  dumnels  are  extremely  noise- 
free,  thereby  greatly  relieving  the  netwtMkof  extensive 
error  control.  Faster  switches  have  also  been 
developed  due  to  the  progress  in  VLSI  technolo¬ 
gy.  However,  the  communication  bandwidth  has 
grown  much  more  rapidly  due  to  fiber  optics  than 
has  the  speed  of  the  switch  due  to  VLSI.  Indeed, 
prior  to  tte  fiber  optic  revolution,  the  communi¬ 
cation  Imk  had  represented  the  performance  bot¬ 
tleneck,  and  so  one  was  prepared  to  waste  switch 
capacity  m  order  to  save  communication  capacity. 
This  took  the  form  of  packet  switching  in  which  intel¬ 
ligent  switches  were  introduced  into  our  data 
communication  networks  in  order  to  dynamically 
assign  the  channel  bandwidth  on  a  demand  basis. 
However,  now  that  fiber  optics  has  appeared,  the 
communication  bandwidth  is  no  longer  a  constraint; 
in  fact,  a  reversal  in  the  relative  cost  of  switching  and 
transmission  has  taken  place  and  has  led  us  to 
architectures  in  which  the  switch  has  now  become 
the  economic  as  well  as  the  performance  bottleneck. 
Considerable  research  and  development  effort  is 
currently  under  way  to  produce  high-speed  pack¬ 
et  switches  [S]. 

New  protocols  which  take  advantage  of  these 
hardware  improvements  have  also  been  devel¬ 
oped.  In  particular,  the  ISDN  signallii^  channel  (the 
D  channel)  uses  a  streamlined  protocol  for  rout- 
ingsignallingpackets  (known  as  the  Link  Access  Pro¬ 
tocol  for  the  D  channel  -  LAPD)  [6];  indeed,  it 
only  processes  these  packets  up  to  the  second 
layer  (the  data  link  layer)  of  the  seven  layer 
m^l,  extracting  a  minimal  amount  of  network  layer 
information.  Frame  Relay  uses  the  LAPD  proto¬ 
col  for  the  data  channel  (rather  than  just  for  the 
signalling  channel),  thereby  achieving  much  high¬ 
er  transfer  speeds  than  were  possible  with  X.2S  pack  - 
et  networks.  Thus,  by  relegating  as  much  function  to 
hardware  as  possible,  by  moving  function  out  of 
the  network  when  possible  (e.g.,  error  control  on  the 
dau  packets),  and  by  taki^  advantage  of  stream¬ 
lined  padcet  protocols.  Frame  Relay  is  aUe  to  achieve 
packet  switdiing  at  Tl  speeds.  The  killer  ^jplication 
whidi  hasbeen  driving  frxoe  bdiind  Frame  Relay 
is  that  of  local  area  netwt^  (LAN)  interctmnection. 
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■  Figure  1 .  Sending  a  I -Mb  file  acmss  the  U.S. 


In  addition,  we  have  seen  some  multi-megabit 
data  network  plans,  announcements  and  offer¬ 
ings.  Among  these  are  the  Fiber  Distributed 
Data  Interface  (FDDl)  at  100  Mb/s  (7),  Switched 
Multimegabit  Data  Services  (SMDS)  at  45  Mb/s  [8], 
the  Distributed  Queue  Dual  Bus  access  protocol 
at  45  and  150  Mb/s  [9],  ATM  switches  and  Broad¬ 
band  ISDN  [  10]  at  155  Mb/s  up  to  2.4  gigabits 
per  second  (Gb/s),  the  High  Performance  Paral¬ 
lel  Interface  (HIPPI)  at  800  Mb/s,  etc.  Indeed, 
the  Synchronous  Optical  Network  (SONET)  [11] 
standard  has  defined  speeds  for  optical  systems  well 
into  the  multigigabit  range. 

It  is  clear  we  are  moving  headlong  into  an  era 
of  gigabit  per  second  speeds  and  networks. 

The  Major  Issue:  Latency  vs. 
Bandwidth 

^  s  we  move  into  the  gigabit  world,  we  must  ask 
^^ourselves  if  gigabits  represent  just  another 
step  in  an  evolutionary  process  of  greater  bandwidth 
systems,  or,  if  gigabits  are  really  different?  In  the 
opinion  of  this  author,  gigabits  are  indeed  different, 
and  the  reason  for  this  difference  has  to  do  with 
the  effect  of  the  latency  due  to  the  speed  of  light. 

Let  us  begin  by  examining  data  communica¬ 
tion  systems  of  various  types.  It  turns  out  that 
there  are  a  few  key  parameters  of  interest  in  any  data 
network  system.  TTiesc  are: 

C  =  Capacity  of  the  network  (Mb/s) 
b  =  Number  of  bits  in  a  data  packet 
L  =  Length  of  the  network  (miles) 

It  is  simplest  to  understand  these  quantities  if  one 
thinks  of  the  network  simply  as  a  communication 
link.  One  can  combine  these  three  parameters  to 
form  a  single  critical  system  parameter,  common¬ 
ly  denoted  as  a,  which  is  defined  as: 

a  »  SLdb  (1) 

This  parameter  is  the  ratio  of  the  latency  of 
the  channel  (i.e.,  the  time  it  takes  energy  to  move 
from  one  end  of  the  link  to  the  other)  to  the  time 
it  takes  to  pump  one  packet  into  the  link.  It  mea¬ 
sures  how  many  packetscan  be  pumped  into  one  end 
of  the  link  before  the  first  bit  appears  at  the 
other  end  [12],  The  factor  S  appearing  in  the 
equation  is  simply  the  approximate  number  of 
microseconds  it  takes  light  to  move  one  mile.> 
Now,  ifwe  calculate  this  ratio  for  some  common  data 
networks,  we  find  the  values  shown  in  Table  1: 
Note  the  enormous  range  for  the  parameter  a.  At 
one  extreme,  namely,  local  area  networks,  it  is  as 
small  as  0.05,  while  at  the  other  extreme,  namely, 
a  cross-country  gigabit  fiber  optic  link,  it  is  as 
large  as  15,000.  This  is  a  range  of  nearly  six  orders 
of  magnitude  for  this  single  parameter! 


■  Figure  2.  Sending  a  1-Mb  file  acmss  the  U.S.  via  an  X.25  network. 


■  Figure  3.  Sending  a  1-Mb  file  across  the  U.S.  via  a  T1  channel 


■  Figure  4.  Sending  a  1-Mb  file  across  the  U.S.  via 
a  1.2-Gb  link. 


We  see  that  a  grows  dramatically  when  we 
introduce  gigabit  lii^  So  we  naturally  must  ask  our¬ 
selves  if  networks  made  out  of  gigabit  links  are 
different  in  some  fundamental  wqr  ^m  those  made 
out  of  kilobit  or  megabit  links.  There  are  two 
cases  of  interest  to  consider.  First,  we  have  the 
case  that  a  large  number  of  users  are  each  shar¬ 
ing  a  small  piece  of  this  large  bandwidth.  In  this 
case  it  is  fairly  clear  that  to  each  of  them,  a  giga¬ 
bit  network  looks  no  different  from  today’s  networks. 

However,  if  we  have  a  few  users  each  sending 
packets  and  files  at  gigabit  speeds,  then  we  do 
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■  Figure  6.  Response  time  vs.  system  had  with  a  15  ms  propagathn  delay. 


see  a  change  in  behavior  and  we  do  run  into  new 
problems.  At  these  speeds,  a  gets  very  large.  To 
see  the  effect  of  this  change,  let  us  consider  the 
following  scenario.  Assume  we  are  sitting  at  a  ter¬ 
minal  and  wish  to  send  one  megabit  across  the 
United  States  to  some  remote  computer  as  shown 
in  Fig.  1. 

Now,  if  the  speed  of  the  communication  chan¬ 
nel  we  have  available  is  64  kb/s  (as  in,  say,  an 
X.25  packet  network),  then,  as  shown  in  Fig.  2, 
the  fiik  bit  of  this  transmission  will  arrive  at  the  East 
Coast  computer  after  approximately  1000  bits 
have  been  pumped  into  the  channel.  Thus  we  see 
that  the  channel  is  buffering  roughly  0.001  of  the 
message;  that  is,  there  is  1000  times  as  much  data 
stored  in  the  terminal’s  buffer  as  there  is  in  the  chan¬ 


nel.  Clearly,  if  we  had  a  higher-speed  channel, 
the  time  to  transmit  our  1  Mb  file  ccnild  be  reduced. 
That  is,  we  can  benefit  from  more  bandwidth. 

Thus,  let  us  now  increase  the  speed  of  the 
channel  and  use  a  T1  channel  (1.S44  Mb/s).  In 
Fig.  3  we  show  this  new  configuration.  Now  we 
find  that  the  terminal  is  buffering  roughly  40 
times  as  much  data  as  is  the  channel.  Once  again, 
we  see  that  we  can  benefit  from  more  bandwidth. 

Let  us  now  increase  the  channel  speed  to  a 
gigabit  channel;  in  particular,  we  will  assume  a 
1 .2  Gb/s  link  (the  OC-24  SONET offering).  This  case 
is  shown  in  Fig.  4  where  we  see  the  entire  I  Mb 
file  asa  small  pulse  moving  down  the  channel.  Indeed, 
the  pulse  occupies  roughly  only  O.OS  of  the  chan¬ 
nel  “buffer.”  It  is  now  clear  that  more  bandwidth 
is  of  no  use  at  all  in  speeding  up  the  transmission 
of  the  file;  it  is  the  latency  of  the  channel  that 
dominates  the  time  to  deliver  the  file! 

Therein  lies  the  fundamental  change  that 
comes  about  with  the  introduction  of  gigabit  links 
into  nationwide  networks.  Specifically,  we  have  passed 
fiom  the  regime  (of  pre-gigid>it  networking)  in  which 
we  were  capacity  limited,  to  the  new  regime  of 
being  latency  limited  in  the  post-gigabit  world.  Things 
do  indeed  change  (as  we  shall  see  below).  The  speed 
of  light  is  the  fundamental  limitation  for  file 
transfer  in  this  regime!  And  the  speed  of  light  is  a 
constant  of  nature  which  we  have  not  yet  been 
able  to  change! 

In  the  considerations  above,  we  assumed  that  our 
file  was  the  only  traffic  on  the  link.  Let  us  now 
consider  the  case  of  competing  traffic  with  small¬ 
er  packets.  Indeed,  let  us  now  assume  that  we 
have  the  classical  queueing  model  of  a  Poisson  stream 
of  arriving  messages  requesting  transmission  over 
a  communication  link,  where  each  message  has  a 
length  which  is  exponentially  distributed  with  a  mean 
of  128  bytes  (i.e..  a  classic  M/M/1  queueing  sys¬ 
tem)  [13].  If,  as  usual,  we  let  p  denote  the  system 
utilization  factor,  then  p  =  X(  1024/C)  where  X.  is 
the  arrival  rate  (messages  per  microsecond)  and 
C  is  the  channel  capacity  (Mb/s).  In  this  situation, 
we  know  that  T,  the  mean  response  time  (mil¬ 
liseconds)  of  the  system  (i.e.,  the  mean  time  from 
when  the  message  arrives  at  the  tail  of  the  trans¬ 
mit  queue  until  the  last  bit  of  the  message  appears 
at  the  output  of  the  channel,  including  any  propa¬ 
gation  delay),  is  given  by 


1.024 /C 
- +  r 

1-p 


(2) 


where  t  is  the  propagation  delay  (i.e.,  the  channel 
latency)  in  milliseconds. 

Let  us  ask  ourselves  if  gigabit  channels  actual¬ 
ly  help  in  reducing  the  mean  response  time.  T.  In 
Fig.  S,  we  show  the  mean  response  time  ( in  millisec ) 
versus  the  system  load  p  for  three  different  chan¬ 
nel  speeds.  In  this  figure,  we  assume  that  the 
speed  of  light  is  infinite,  and  so  t  =  0.  The  chan¬ 
nel  speeds  we  choose  are  the  same  as  those  con¬ 
sidered  above,  namely  64  kb/s,  1.544  Mb/s  and  1.2 
Gb/s.  We  note  a  significant  reduction  in  T  when 
we  increase  the  speed  from  64  kb/s  to  1.544  Mb/s; 
thus,  the  fasterTl  channel  helps.  However,  note  that 
when  we  go  fiom  1.544  Mb/s  to  1.2  Gb/s,  we  see  almost 
no  improvement.  (The  only  region  in  which  there 
is  an  improvement  with  gigabits  is  at  extremely 
high  loads,  a  situation  to  be  avoided  for  other 
reasons).  As  far  as  response  time  is  concerned. 


3« 


IEEE  Communications  Magazine  •  April  1992 


gigabits  do  not  help  here! 

One  might  argue  that  the  assumption  of  zero 
propagation  delay  has  biased  our  conclusions. 
Not  so:  in  Fig.  6  we  show  the  case  with  a  15-ms 
propagation  delay,  (i.e.,  the  propagation  delay  across 
the  USA)  and  we  see  again  that  gigabits  do  not  help. 

We  can  sharpen  our  treatment  of  this  latency- 
versus-bandwidth  discussion  as  follows.  Let  us  assume 
that  we  have  an  M/M/1  model  as  above,  where 
the  messages  have  an  average  length  equal  to  b 
bits.  .Assume  we  wish  to  transmit  these  files 
across  the  United  States,  as  in  the  earlier  figures. 
Now.  as  can  be  seen  from  Eq.  ( 2),  there  are  two  com- 
ponents  making  up  the  response  time,  namely, 
the  queueing-plus-transmission  time  delay  (the  first 
term  in  the  equation)  and  the  propagation  delay  ( t). 
In  this  paper,  we  have  been  discussing  the  rela¬ 
tive  size  of  each  of  these  and  we  referred  to 
regions  of  bandwidth-  limited  and  latency-limited 
systems.  Let  us  now  make  those  concepts  more 
precise.  We  choose  to  define  a  sharp  boundary 
between  these  two  regions.  In  particular,  we 
define  this  boundary  to  be  the  place  where  the 
two  terms  in  our  equation  are  exactly  equal, 
namely,  where  the  propagation  delay  equals  the 
queueing-plus-transmission  time  delay.  From  Eq. 
(2)  we  see  that  this  occurs  when  the  bandwidth  of 
the  channel  takes  on  the  following  critical  value. 


lOOOft 

“(l-p)r 


(3) 


In  Fig.  7,  we  plot  this  critical  value  of  band¬ 
width  (on  a  log  scale)  versus  the  system  load  p; 
we  have  drawn  this  plot  for  the  case  of  t  =  15  ms 
and  a  message  length  of  one  megabit.  Above  this 
boundary,  the  system  is  latency  limited,  which  means 
that  more  bandwidth  will  have  negligible  effect  in 
reducing  the  mean  response  time,  T.  Below  this 
boundary,  the  system  is  bandwidth  limited  which 
means  that  it  can  take  advantage  of  more  bandwidth 
to  reduce  T.  Note  that  for  these  parameters  the 
system  is  latency  limited  over  most  of  the  load  range 
when  a  gigabit  channel  is  used;  this  means  that 
for  these  parameters,  a  gigabit  channel  is  overkill 
so  far  as  reducing  delay  is  concerned. 

We  repeat  this  plot  in  Fig.  8  for  a  number  of 
different  message  sizes.  Without  labeling  the  regions, 
the  same  comments  apply,  namely,  systems  above 
the  curve  are  latency  limited,  and  below  they  are 
bandwidth  limited.  We  note  that  gigabit  channels 
begin  to  make  sense  for  message  sizes  of  size  10 
megabits  or  more,  but  are  not  helpful  for  smaller  hie 
sizes.  This  comment  about  message  size  refers  to  the 
file  size  that  the  user  application  generates;  the 
fact  that  ATM  uses  S3-byte  cells  has  little  to  do 
with  this  comment. 

Figures  7  and  8  apply  to  the  case  of  a  cross 
country  link(i.e.,withapropagation  delay  of  rough¬ 
ly  ms).ForotherthanT>  15nis,thecriticalband- 
wioth  which  defines  the  boundary  is  given  from 
Eq,  (3). 


Other  Issues 

I  il/e  have  dealt  with  the  latency-bandwidth  trade- 
If  Ir  offforgigabitnetworksint^paper.Ofcourse 
there  are  a  number  of  other  issues  to  be  addressed 
in  gigabit  nets,  some  of  which  we  choose  to  men¬ 
tion  in  this  section. 

Consider  the  example  from  the  previous  section. 


■  Figure  7.  Bandwidth  vr.  system  load  fora  1-Mb  file  sent  across  the  U.S. 


namely,  a  gigabit  link  spanning  the  United  States. 
Suppose  we  start  transmitting  a  file  a  time  t=0. 
Roughly  IS  ms  later,  the  first  bit  will  appear 
across  the  country.  Now  suppose  that  the  receiv¬ 
ing  process  decides  immediately  that  it  cannot  accept 
this  new  flow  which  has  begun.  By  the  time  the 
first  bit  arrives,  however,  there  are  roughly  IS 
million  bits  already  in  the  pipe  heading  toward 
this  receiving  process!  And,  by  the  time  a  stop 
signal  reaches  the  source,  another  IS  million  bits  will 
have  been  launched!  It  does  not  take  too  much  imag¬ 
ination  to  see  that  we  have  a  problem  here.  It  is  basi- 
cally  a  congestion  control  and  flow  control 
problem.  Qearty,  a  closed  control  feedback  method 
of  flow  control  is  too  sluggish  in  this  environment 
(due,  once  again,  to  latency).  Some  other  forms 
of  control  must  be  incorporated.  For  example, 
one  could  use  rate-based  flow  control  in  which 
the  user  is  permitted  to  transmit  at  a  maximum  allow¬ 
able  rate. 

Moreover,  at  the  application  level,  it  is  impor¬ 
tant  to  find  ways  to  hide  this  latency,  in  order  to 
get  full  advantage  of  the  gigabit  links  and  of  the  high 
performance  processors  attached  to  a  gigabit  net¬ 
work.  One  way  to  hide  latency  is  to  use  some 
form  of  parallelism  (or  pipelining)  such  that 
while  one  process  is  waiting  for  a  response, 
another  process,  which  does  not  depend  upon 
this  response,  may  proceed  with  its  processing. 

Another  issue  has  to  do  with  the  maximum  attain- 
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able  efficiency  that  one  can  obtain  by  taking 
advantage  ofstatiaticalniiiltipkxingofbunty  sources 
in  a  gigabit  environment  If  we  have  a  large  num¬ 
ber  of  smaU  bursty  sources,  then  statistical  multi- 
pteadng  takes  exquisite  advantageof  the  LawofLar^ 
Numbers  [13],  and  allows  one  to  drive  these 
channels  at  very  high  efficiencies.  However,  if  we 
have  a  small  number  of  large  sources,  then  the 
multiplexing  does  not  usually  lead  to  very  high 
efficiencies.  This  is  because  statistical  smoothing 
of  a  small  number  of  sources  is  not  sufficient  tobring 
about  the  advantagesof  statistical  multiplexing.  Fur¬ 
thermore,  if  we  have  a  large  number  of  non- 
hcxnogeneoussouroesiooemust  calculate  the  effective 
number  of  such  sources  in  order  to  calculate  the 
efficiency  to  be  e:q)ected  from  multiplexing  [14]. 

Conclusions 

rhe  major  conclusion  of  this  paper  is  to  itcog- 
nize  that  gigabit  networks  have  forced  us  to 
deal  with  the  propagation  delay  due  to  the  finite 
speed  ofiighL  Fifteen  millisecond  tocross  the  Unit- 
^  Sutes  is  an  eternity  when  we  are  talking  about 
gigabit  links  and  microsecond  transmission  times. 
As  we  saw  earlier,  the  propagation  delay  across 
the  USAisforty  timessniallerthan  the  time  required 
to  transmit  a  1-Mb  file  into  a  T1  link.  At  a  giga¬ 
bit,  the  situation  is  completely  reversed,  and  now  the 
propagation  delay  is  IS  times  larger  than  the  time 
to  transmit  into  the  link.  We  have  moved  into  a 
new  domain  in  which  the  considerations  are  com¬ 
pletely  reversed.  Wemustrethinkanumber  of  issues. 
For  example,  the  user  must  pay  attention  to  his 
file  sizes  and  how  latency  will  affect  his  applications. 
The  user  must  try  to  hide  the  latency  with  pipelin¬ 
ing  and  parallelism.  Moreover,  the  system  design¬ 
er  must  think  about  the  problems  of  flow  control, 
buffering,  and  congestion  control.  Some  form  of  rate- 
based  flow  control  will  help  the  designer  here.  He 
must  also  design  algorithms  which  make  rapid 
decisions  if  enormous  buffer  requirements  are  to 
be  avoided.  The  designer  caruiot  depend  on  glob¬ 
al  state  information  being  available  in  a  timely 


fashion:  this  affects  his  choice  of  control  algorithms. 
In  many  ways,  the  user  will  see  gigabit  networks 
as  being  different  from  megabit  networks;  the 
same  is  true  for  the  designerfimplementer. 

Mudi  more  research  must  be  ^ne  before  we  can 
claim  to  have  solved  many  of  the  problems  that 
this  new  environment  has  exposed.  We  must 
solve  these  problems  in  the  near  future  if  we  are 
to  enjoy  the  benefits  that  fiber  optics  has  given  us 
in  the  form  of  enormous  bandwidths. 
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Abstract 

In  this  paper  a  multiple  access  protocol  is  proposed  for  a  system  con¬ 
sisting  of  many  high-speed  bursty  traffic  stations  via  an  c^tical  passive  star 
coupler.  Bach  tiser  has  access  over  a  range  of  wavelengths,  thus  resulting  in 
a  wavelength  division  multiplexed  communication.  The  performance  of  both 
the  infinite  and  finite  population  cases  has  been  modeled  and  analyzed.  Nu¬ 
merical  results  show  that  low  delay  and  hi^  throughput  (larger  than  the 
electronic  speed  of  a  single  user)  can  be  adiieved.  The  analysis  also  shows 
that  the  best  performance  is  obtained  when  the  capacities  of  the  reservation 
channels  and  the  data  channels  are  balanced. 

Keywords:  Wavelength  Division  Multiplexing,  Multiaccess  Protocol,  Local 
Area  Networics,  Performance  Analysis,  Fiber  Optics. 
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1  Introduction 


The  explosive  advance  of  fiber  optics  technology  in  the  past  decade  offers  a  com¬ 
bination  of  wide  bandwidth  and  low  attenuation  unmatched  by  any  other  trans¬ 
mission  medium.  In  high-capacity  Local  Area  Networks  (LANs),  where  distance  is 
short  and  propagation  loss  is  of  little  concern,  it  is  the  availability  of  high  band¬ 
width  that  makes  fiber  so  attractive.  It  is  conceivable  that  by  using  the  low-loss 
passband  of  optical  fibers  (1200-1600  nm),  we  could  construct  multiple  access  net¬ 
works  carrying  a  total  trafiac  of  around  50  terabits  per  second  [1].  An  obstacle  to 
realizing  such  high-speed  transmission  of  optical  signals  is  the  bottleneck  at  the 
electronic  interface.  A  fundamental  limitation  of  single-channel  high-speed  net¬ 
works  such  as  Expressnet  [2],  FDDI  [3],  and  DQDB  [4]  is  that  in  these  networks, 
the  maximum  throughput  of  the  entire  network  is  limited  to  the  rate  that  can 
be  supported  the  electronics  of  one  of  the  end  user  stations.  Wavelength  Divi¬ 
sion  Multiple  Access  (WDMA)  eliminates  this  bottleneck  by  operating  on  multiple 
channels  at  different  wavelengths,  with  each  channel  running  at  a  moderate  speed. 
However,  the  control  of  the  WDMA  system  has  proven  to  be  a  major  obstacle  to 
turning  the  vast  link  capacity  into  system-wide,  user-accessible  capacity;  it  is  nec¬ 
essary  to  develop  efficient  medium  access  techniques  for  packet  conununications 
in  this  environment. 

Today’s  electronically  tunable  semiconductor  lasers  and  filters  can  tune  from 
one  wavelength  to  another  in  a  few  nanoseconds.  However,  the  tuning  range  is  lim¬ 
ited.  Therefore,  each  node  can  only  operate  on  a  small  number  of  wavelengths  [5]. 
One  class  of  WDMA  networks  can  be  constructed  the  use  of  fixed  (wavelength) 
transmitters  and  fixed  (wavelength)  re<%ivers.  These  networks  employ  multi-hop 
topologies  in  which  a  packet  may  be  routed  through  several  intermediate  nodes 
before  it  is  delivered  to  its  destination.  Examples  of  this  type  of  networks  can  be 
found  in  [6,  7].  The  second  class  of  WDMA  networks  assumes  single-hop  com¬ 
munications  [8-10]  where  multiple  channels  are  created  by  employing  tunable 
transmitters  and/or  receivers.  In  [8],  a  station  is  equipped  with  multiple  fixed 
transmitters  and  multiple  tunable  receivers.  Various  architectural  alternatives  for 
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WDMA  LANs  are  discussed.  In  [9],  a  single  tunable  transmitter  with  limited  tun- 
ability  and  multiple  fixed  receivers  are  provided  to  each  station.  A  random  time 
division  multiple  access  (TDMA)  protocol  is  used  to  determine  which  wavelength 
a  station  is  allowed  to  transmit  on.  In  [10],  tunable  filters  (i.e.,  receivers)  capable 
of  rapid  tuning  over  a  large  number  of  channels  are  assumed  in  order  to  support 
packet  switching.  Also,  large  packets  have  to  be  used  since  the  slot  (and  thus  the 
packet)  size  is  proportional  to  the  number  of  stations  in  the  network. 

The  WDMA  bcal  area  network  can  be  realized  using  a  bus  or  a  star  topology. 
A  star  topology  is  preferred  [1]  because  the  statbn-to-station  link  attenuation 
grows  linearly  with  the  number  of  stations.  In  a  bus  topology,  the  excess  loss  is 
grows  quadratically  with  the  number  of  stations. 

The  system  under  consideration  in  this  paper  is  a  passive  star  network  as  shown 
in  Figure  1.  There  are  {W  +  1)  wavelengths  available,  Ao,  Ai, . . . ,  Aw  to  serve  N 
attached  stations.  The  channel  at  wavelength  Ao  serves  as  the  control  channel  for 
the  exchange  of  the  control  traffic,  while  the  other  W  channels  are  for  actual  data 
traffic.  Each  station  is  equipped  with  two  lasers:  one  fixed  laser  tuned  at  Ao  and 
the  other  laser  tunable  to  any  of  the  wavelengths  Aj, . . . ,  Aw-  The  output  of  the 
two  lasers  is  coupled  into  a  2  x  1  combiner,  the  output  of  which  is  connected  to 
one  of  the  inputs  of  the  N  x  N  star  coupler.  Signals  transmitted  at  all  of  the 
{W  +  1)  wavelengths  are  combined  at  the  star  coupler  and  distributed  to  all  of 
the  stations.  Each  station  also  has  two  receivers:  one  fixed  filter  tuned  at  Ao  and 
the  other  tunable  to  any  of  the  wavelengths  Ai, . . . ,  Aw>  At  the  receiver,  the  input 
optical  signal  is  split  into  two  parts  means  of  a  1  x  2  splitter.  One  part  goes 
to  the  fixed  optical  filter  which  passes  only  the  control  wavelength  Ao,  and  the 
other  output  goes  to  the  tunable  filter  which  is  tuned  to  pass  the  desired  data 
wavelength. 

In  this  paper,  a  multiaccess  protocol  based  on  reservation- ALOHA  [11]  is  pro¬ 
posed  and  analyzed.  In  Section  2,  we  describe  the  details  of  the  protocol.  Section  3 
presents  the  analysis  of  mathematical  models  of  both  the  infinite  and  the  finite 
population  cases.  In  Section  4,  numerical  results  from  both  analysis  and  simulation 
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are  given.  Section  5  concludes  the  paper. 


2  Description  of  Protocol 

2.1  The  Protocol 

We  assume  the  existence  of  a  common  clock,  obtained  either  by  distributing  a  clock 
to  all  the  stations  or  by  means  of  some  self-clocking  mechanism  inherent  in  the 
data.  The  problem  of  generating  the  global  clock  is  addressed  and  solved  in  [12]. 
Time  is  divided  into  slots.  Packets  are  of  fixed  length,  which  is  equal  to  one  slot. 
The  propagation  delay  from  any  station  to  the  star  coupler  and  then  to  any  other 
station  is  assumed  to  be  equal  to  R  slots.  Slots  on  the  data  channels  are  called 
data  slots  and  contain  the  actual  data  packets.  Slots  on  the  control  channel  are 
called  control  slots  because  they  carry  only  control  information  about  the  packets 
and  the  transmitters.  Each  control  slot  consists  of  a  reservation  subpart  and  a 
tuning  subpart.  The  reservation  subpart  is  divided  into  V  minislots  to  be  used 
on  a  contention  basis  with  the  sbtted  ALOHA  protocol,  and  the  tuning  subpart 
is  divided  into  W  minisbts  to  convey  the  wavelength  tuning  information.  The 
structure  of  a  control  slot  is  shown  in  Figure  2. 

A  station  generating  a  packet  will  randomly  select  one  of  the  V  reservation 
minislots  in  the  next  control  slot  and  transmit  a  reservation  minipacket  on  the 
control  channel.  R  slots  later  the  station  will  hear  the  result  of  its  reservation.  If 
it  is  successful,  it  is  received  by  all  the  stations  because  of  the  broadcast  nature  of 
the  control  channel  All  successful  reservations  join  a  conunon  distributed  queue  of 
stations  waiting  to  transmit.  If  there  is  a  collision,  the  station  will  transmit  another 
reservation  minipadoet  in  the  next  slot  with  probability  p,  and  with  probability 
(1  -  p)  it  will  defer  the  decision  one  slot  and  transmit  the  reservation  in  this 
next  slot  with  probability  p,  etc. 

In  the  tuning  subpart  of  each  control  slot,  each  of  the  first  W  stations  in  the 
distributed  queue  will  transmit  a  tuning  minipacket  in  an  assigned  tuning  minislot; 
the  minipacket  contains  the  destination  address  and  other  relevant  information 


3 


station 


tunabla  transmittaz 

flzad  transmlttaz 
tunad  at 


I  tunabla  zacalvar 

□flxad  racalvaz 
tunad  at 


Figure  1:  N  stations  connected  by  a  passive  star  coupler 
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Figure  2:  Structure  of  a  control  slot 

for  the  data  packet  to  be  transmitted  in  the  next  slot.  In  [)articular,  the  first 
station  in  the  queue  transmits  its  minipacket  in  the  first  tuning  minislot,  the 
second  station  in  the  queue  transmits  in  the  second  minislot,  . . . ,  and  the  Wth 
station  transmits  in  the  IVth  minislot.  The  position  of  the  tuning  minislot  uniquely 
determines  the  wavelength  to  use.  At  the  beginning  of  the  next  slot,  those  W 
stations  tune  their  tunable  transmitters  to  their  assigned  wavelengths,  with  the  ith 
station  in  the  queue  using  wavelength  Xi  to  transmit  its  data  packet.  When  there 
are  fewer  than  W  stations  in  the  distributed  queue,  some  data  wavelengths  will  be 
unused.  When  the  destination  sees  its  address  announced  in  a  tuning  minislot  on 
the  control  channel,  it  tunes  its  tunable  receiver  to  the  corresponding  wavelength 
and  receives  the  data  packet  at  the  beginning  of  the  next  slot.  If  two  or  noore 
packets  are  addressed  to  the  same  destination  in  a  sbt,  we  arbitrarily  select  the 
one  transmitted  on  the  lower  wavelength  number  to  win  the  competition  (the 
arbitration  can  also  be  made  by  the  use  of  relevant  information  carried  in  the 
tuning  minipacket,  such  as  the  packet  age  or  priority).  The  losing  stations  must 
start  over  with  the  reservation  procedure  again. 

Thus,  a  station  desiring  to  send  a  packet  must  first  compete  on  the  ALOHA 
reservation  subchannel  to  gain  access  to  a  minislot  on  the  tuning  subchannel. 
The  station  then  informs  its  intended  receiver  to  listen  (i.e.,  tune)  to  a  particular 
wavelength  on  which  the  data  packet  will  be  transmitted.  If  a  given  receiver  is 
informed  more  than  one  station,  only  one  station  will  be  selected,  and  the 
others  must  repeat  the  entire  procedure. 
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2.2  An  Illustrative  Example 


Here  we  give  a  simple  example  where  we  assume  N=10,  V=5,  W=2,  and  R=2. 
Let  slot  t  denote  the  slot  from  time  t  to  time  t  +  1.  Define  (s,d)  as  a  packet, 
where  s  denotes  the  source  node  and  d  the  destination.  At  time  0,  five  packets, 
(1,7),  (2,6),  (3,9),  (4,8),  (5,8),  are  generated.  In  slot  0,  five  reservation  packets 
are  transmitted  with  node  2  and  3  transmitting  in  the  same  minislot.  At  time  3, 
nodes  2  and  3  find  out  that  they  are  involved  in  a  collision,  while  nodes  4,  5,  and 
I  realize  the  success  of  their  reservations  and  join  the  distributed  queue  (in  the 
order  4,  5,  1).  Define  Ct  as  the  length  of  the  distributed  queue  at  time  t.  At  this 
time  the  queue  length  C3  equals  three.  In  slot  3,  node  2  tosses  a  coin  and  decides 
to  transmit  a  reservation  minipacket  again,  while  node  3  tosses  a  coin  and  decides 
to  defer  the  decision  by  one  slot.  Meanwhile,  in  the  tuning  subpart  of  slot  3,  nodes 
4  and  5  write  their  destination  addresses  (both  are  8)  into  tuning  minislots  1  and 
2,  and,  in  slot  4,  transmit  their  actual  data  packets  on  wavelengths  Ai  and  A2, 
respectively.  In  slot  5,  node  8  finds  out  from  the  control  channel  that  two  packets 
are  coming  for  it  and  tunes  its  tunable  receiver  to  Ai  (the  lower  wavelength)  to 
receive  the  data  packet  from  node  4.  At  the  same  time  node  5  realizes  that  it  lost 
the  competition  (because  of  the  broadcast  nature  of  the  control  channel).  It  tosses 
a  coin,  then  decides  to  transmit  a  lesorvation  minipacket  in  slot  6  and  restarts  its 
reservation  procedure  again. 

3  Performance  Analysis 

3.1  Model  Assumptions 

We  manima  that  there  are  (W+1)  wavelenghts  available  and  that  the  number  of 
stations  in  the  network  is  N.  Each  station  has  a  single  bufiSer  which  is  equal  to 
the  size  of  a  packet.  A  new  packet  arrives  at  a  station  with  an  empty  buffer  with 
probability  <r  at  the  end  of  a  slot.  A  packet  generated  by  a  station  is  addressed  to 
any  of  the  other  ( iV-l)  stations  with  equal  probability.  A  source  station  with  a  full 


c  :  tlM  Icagtli  of  tho  distrlbutod  quooo 
Figure  3:  A  scenario  of  packet  transmissions.  (iVsslO,  V=5,  IV=2,  R=2.) 
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buffer  (i.e.,  one  packet)  will  not  discard  its  packet  until  its  successful  reception  at 
its  destination  is  recognized. 


3.2  The  Infinite  Population  Case 

In  this  subsection  we  consider  the  infinite  population  case  where  the  number  of 
stations,  N,  is  infinitely  large.  We  assume  that  the  total  reservation  traffic  offered 
to  the  V  ALOHA  channels  forms  a  Poisson  process  with  rate  G  requests  jjer  slot. 
Thus  the  traffic  offered  to  one  ALOHA  channel  is  G/V  requests  per  slot.  Specif¬ 
ically,  G  =  lim  Na.  The  probability  that  exactly  one  reservation  is  transmitted 

JV-.00 

in  a  single  ALOHA  channel  is  [13].  Define  At  as  the  number  of  success¬ 
ful  requests  in  slot  t.  Let  A  =  lim  At .  The  probability  mass  function  (pmf)  and 
2-transform  of  A  are 

a,  A  Pt<MA  =  }]  =  j  = 


The  throughput  of  the  V  ALOHA  reservation  channels  is 

=  £(41  =  =  Ge-f  (1) 

with  the  maximum  throughput,  V/e,  occurring  at  G  =  V.  Therefore,  the  capacity 
of  the  system  will  be  the  minimum  of  the  reservation  channel  throughput  and  the 
data  channel  throughput,  namely,  min(V/e,  W). 

The  packet  delay  D,  defined  as  the  time  interval  from  the  padcet  arrival  instant 
to  the  successful  reception  at  the  destinatmn,  can  be  computed  as  follows  : 

D  =  Dr  ■¥  D^-\-  {R  +  \) 


where  Dr  is  the  reservation  delay  defined  as  the  interval  between  the  packet  arrival 
instant  and  the  moment  the  success  of  the  reservation  is  realized.  D^  is  the 
queueing  delay,  which  is  the  time  period  from  the  instant  the  success  of  reservation 


8 


is  recognized  until  the  beginning  of  the  successful  transmission  of  the  data  packet. 
(R+ 1)  slots  account  for  the  transmission  (1  slot)  and  propagation  delay  {R  slots). 
Therefore,  the  average  packet  delay  is 

E[D]  =  E[Dr]  +  E[D,]  +  (i?  +  1)  (2) 

We  compute  E[Dr]  first.  Define 

qr  =  Prob  [a  transmitted  reservation  request  is  successful]  = 

Therefore,  we  have 

Tn  =  Prob  [reservation  succeeds  at  the  nth  trial] 

=  9r(l  ~  (Zr)""' 
and 

r  =  the  average  number  of  reservation  requests  sent  per  packet 

=  1/qr  = 

The  average  time  between  two  consecutive  transmissions  of  the  request  is  (^  +  i2); 
therefore, 

E[Dr]  =  (fl  +  1)  +  (r  -  1)(1  +  R)  (3) 

To  compute  E[Dq],  let  C  =  Um  Ct  be  the  length  of  the  distributed  queue  at 
the  end  of  a  slot  in  steady  state,  and  X  be  the  position  of  a  typical  (say  tagged) 
successful  reservation  among  those  successful  ones  in  the  same  slot.  E[Dq\  can  be 
computed  as  follows: 

=  [SM]  (4) 

To  get  £[C7],  we  first  have 

C*+i  =  max  (0,  Ct  +  At+i  —  W) 

Assume  that  steady  state  exists.  Solving  this  using  the  technique  in  [14],  we  get 

W-l  W-i-l 

LEW*'*'-*"') 

C(*)  =  E  “  (1  -  ge-»  +  -  J*' 
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where  Cj  =  Prob{C  =  j)  is  the  pmf  of  C.  We  denote  the  denominator  of  C(z)  by 
D(z).  Using  Rouche’s  theorem  [14],  it  can  be  shown  that  W  roots  of  D(z)  are  on 
or  inside  the  unit  circle  |z|  =  1.  Those  roots  must  cancel  out  with  the  roots  of  the 
numerator.  Therefore,  C(z)  becomes 

(2  -  zi)(z  -  Z2)  •  •  •  (z  -  Z(y-w)) 

where  21, 22, ,  2(V'-w)  are  the  {V —W)  roots  of  D{z)  outside  the  unit  circle  |2|  =  1 
and  S  is  a  constant.  The  condition  C(l)  =  1  gives  us  B  =  (1  —  21) ...  (1  —  z^y^w))- 
Therefore, 

^  (1  -  -2l)(l  -  Z2)  •  •  •  (1  -  Z{V-W)) 

*  (2  -  2i)(2  -  22)- •  •  (2  -  2(V_W)) 


and 


E[C]  = 


dC{z) 


dz 


v-w 


1 

l-Zi 


(5) 


Now  we  compute  E{X\.  Let  Xj  =  Prob(X  =  j)  be  the  pmf  of  X.  Define  the 
random  variable  K  to  be  the  total  number  of  successful  reservations  in  the  same 
slot  where  the  tagged  successful  reservation  resides.  The  pmf  for  if  is 


Pro6(A:  =  fc)  =  ^  fc  =  l,...,K 

Since  the  tagged  reservation  can  be  at  any  position  in  a  group  with  equal  proba¬ 
bility,  we  have 

Prob(X  =  j\K  ==  k)  =  ^  j  =  l,...,k. 

Unconditioning  on  k,  we  get 


Xi  = 


kai, 


V  I 
1  ^ 


J-l . V 


and 
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E[X] 


Yjx  =  14. 

'  2  ^  2E[A] 

1  +  icv  - 


where  we  have  evaluated  E[A^]  =  Ge~^  ^1  +  (V  —  j  from  the  expression  for 

A(z)  given  earlier.  E[D^]  can  now  be  computed  according  to  Eq.  (4).  The  average 
packet  delay  is  finally  obtained  from  Eq.  (2). 

In  Figure  4,  we  plot  the  ALOHA  channel  capacity  and  the  data  channel  ca¬ 
pacity.  For  a  fixed  value  of  the  reservation  traffic,  we  see  that  the  throughput  is 
the  minimum  of  the  ALOHA  channel  capacity  and  the  number  of  wavelengths. 
Figure  5  shows  the  throughput-delay  curve.  These  curves  give  the  performance  for 
an  infinite  population  of  stations  whose  collective  generation  rate  of  new  packets 
is  5.  From  Figure  5  we  see  that  the  system  is  not  stable  (there  are  two  dffierent 
values  of  delay  associated  with  a  given  throughput)  and  some  dynamic  control 
procedure  (e.g.,  see  [15])  will  be  required  to  stabilize  the  ^stem. 


3.3  The  Finite  Population  Case 

In  this  subsection  we  analyze  the  case  when  the  number  of  stations  in  the  sys¬ 
tem  is  finite.  An  approximate  model  of  the  system  is  shown  in  Figure  6.  In  this 
model,  each  station  can  be  in  one  of  the  following  {3R  +  3)  modes  at  any  instant: 
TH,  RT,  Q,  PQm,  PRm,  and  PSm  (1  <  wi  <  R).  Each  station  can  move  from  one 
mode  to  another  mode  only  at  the  beginning  of  each  slot. 

Stations  in  each  noode  act  as  follows.  Stations  in  the  TH  (thinking)  mode  gen¬ 
erate  a  packet  with  probability  a  at  the  end  of  a  sbt.  Stations  in  the  Q  (queued) 
mode  are  currently  in  the  distributed  queue.  A  station  that  had  suffered  a  collision 
of  its  reservation  packet  and  has  realized  it  is  said  to  be  in  the  RT  (retransmis¬ 
sion)  mode  and  will  retransmit  the  reservation  with  probability  p  in  the  next  slot. 
Stations  in  the  PQm  nx>de  will  move  into  the  PQm-i  noode  at  the  next  slot  with 
probability  1.  Thus,  as  can  be  seen  in  F^ure  6,  stations  in  the  PQm  mode  will 
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Figure  4:  Throughput  versus  offered  reservation  traffic.  N  =  oo,  W=4. 
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Figure  6:  An  approximate  model  of  the  system 

enter  the  Q  mode  after  m  slots.  The  meaning  of  the  PRn  and  the  PS^  modes 
is  similar  to  that  of  the  PQm  mode;  that  is,  these  three  kinds  of  modes  are  unit 
delay  elements  and  express  the  influence  of  the  channel  propagation  delay. 

We  now  define  a  state  vector  of  the  system.  Let  n*r  be  a  random  variable 
representing  the  number  of  stations  in  the  RT  mode,  ng  that  in  the  Q  mode, 
that  in  the  PRn%  mode,  jm  that  in  the  PRm  mode,  and  kn  that  in  the  PSm  mode, 
J7i  =  1, . . . ,  i2.  In  the  model  we  will  further  make  a  nonpersistence  assumption  :  a 
station,  upon  entering  the  RT  mode,  will  randomly  reselect  a  destination  for  its 
packet  (It  is  not  the  case  in  the  real  system,  but  later  we  will  see  that  the  model 
still  predicts  the  perfomaance  very  well  under  this  assumption).  Define  the  vector 

Ay 

**  ~  *if  •  •  •  I  fkn)  aa  the  state  vector  of  the  system.  Then 

we  can  see  that  the  vector  n  forms  a  discrete-time  Markov  chain  with  a  finite  state 
space. 

Unftvtunatefy,  since  the  state  space  is  so  lar§^,  it  is  difiScult  for  us  to  solve  this 
Markov  chain.  Therefore,  we  utilize  the  technique  of  equilibrium  point  analysis 
(EPA)  [16]  to  analyze  this  chain. 
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Figure  7:  A  modified  model  of  Fig.  in  the  case  of  o'  <  p 
3.3.1  The  Modified  Model 

For  simplicity  of  analysis,  we  first  consider  a  modification  of  the  model  in  Figure  6 
as  suggested  in  [16],  which  combines  the  two  inputs  (from  the  THxnode  and  the  RT 
mode)  of  the  slotted  ALOHA  reservation  channel  into  one.  Since  we  have  assumed 
bursty  users,  we  shall  confine  ourselves  to  the  case  a  <p.  The  modified  model  is 
shown  in  Figure  7,  where  the  TH  mode  in  Figure  6  has  been  decomposed  into  two 
modes,  /  and  T,  and  the  RT  mode  in  Figure  6  has  become  part  of  the  T  mode. 
A  station  that  has  just  moved  out  from  the  PS\  mode  moves  into  the  I  and  the 
T  modes  with  probabilities  (1  -  <r/p)  and  a/p,  respectively,  instead  of  entering 
the  TH  mode  with  probability  1.  A  station  in  the  /  mode  will  move  into  the  T 
mode  at  the  next  slot  with  probability  a,  and  a  station  in  the  T  mode  transmits 
a  reservation  minipadoet  (i.e.,  moves  out  from  the  T  mode)  with  probability  p. 
The  model  in  Figure  7  is  equivalent  to  that  in  Figure  6  from  the  viewpoint  of 
the  stodiastic  behavior  to  be  explained  below;  consequently,  we  can  derive  any 
characteristic  of  the  model  in  Figure  6  by  using  the  model  in  Figure  7.  Thus,  there 
is  no  performance  difference  between  these  two  models. 

The  equivalence  of  the  two  models  can  be  interpreted  as  follows.  Let  Xi,  Y\, 
and  Yi  be  random  variables  representing  the  time  (number  of  slots)  during  which 
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a  station  is  in  the  TH,  RT,  and  T  modes,  respectively.  Also  let  X2  be  a  random 
variable  representing  the  time  from  the  moment  a  user  enters  either  the  I  mode 
or  the  T  mode  from  the  PS\  mode  until  the  instant  the  station  moves  out  of  the 
T  mode.  (Note  that  X2  corresponds  to  A'l.)  It  can  be  shown  [16]  that  X2  and  Y2 
have  the  same  pmf’s  as  and  Y\,  respectively. 

In  the  modified  model  in  Figure  7,  we  now  let  nx  be  a  random  variable  rep¬ 
resenting  the  number  of  stations  in  the  T  mode.  Then,  it  is  apparent  that  the 
modified  state  vector  n  =  (nr,nQ,ii,  -  •  •  ,jR,k\,---,kR)  is  also  a  Markov 

chain  under  the  nonpersistence  assumption. 

3.3.2  The  Equilibrium  Point  Equation 

An  equilibrium  point  is  a  point  in  state  space  such  that  at  that  point  the  expected 
increase  in  the  number  of  stations  in  each  mode  per  unit  time  is  zero  [17].  In 
the  EPA  method,  we  assume  that  the  system  is  always  at  an  equilibrium  point. 
Applying  the  above  condition  to  all  the  modes,  we  get  a  set  of  equations  called 
equilibrium  point  equations,  whose  solution  gives  one  or  more  equilibrium  points. 

Let  TI  =  •  •  •  ,*/i, Ji, •  •  •  tjR, k,, •  •  • , kR)  be  an  equilibrium  point.  Let 

Si(1l)  denote  the  conditional  expectation  of  the  increase  in  the  number  of  stations 
in  the  I  mode  in  a  slot,  given  that  the  system  is  at  n.  Setting  ^/(n)  =  0,  we  then 
have 


iV  -  Tlr  -  TSg  -  5;  (tm  +  im  +  fcm) 

m=l 


<7=0 


(6) 


Next,  let  A’(l[)  denote  the  conditional  expectation  of  the  number  of  stations 
that  move  out  from  the  Q  mode  in  a  slot,  given  that  the  system  is  in  state  K. 
Evaluating  noode  Q,  we  get 


«,(H)=Jf(lI)-J,=0  (7) 

Next  we  define  two  other  terms.  Let  /r(13)  denote  the  conditional  expectation 
of  the  number  of  stations  that  successfully  transmit  reservation  minipackets  (thus 
move  from  mode  T  to  mode  PQn)  in  a  slot,  given  that  the  system  is  in  state 
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n.  Let  Qqii)  denote  the  average  number  of  stations  that  successfully  transmit  a 
data  packet  (therefore  moving  from  mode  Q  to  mode  PS^)  in  a  slot,  given  that  i 
stations  transmit  (i.e.  move  out  of  mode  Q).  It  can  be  derived  (see  the  Appendix) 
that 

and 

^<j(*)  ^1^1  (1  (  jv(JV-l)  \ 

Evaluating  the  conditional  expectation  of  increase  for  the  PRm,PQm^  ^d 
^^5^(1  <Tn<R)  modes,  we  obtain  the  corresponding  equations  as  follows. 

ti  =  *2  =  •••  =  »*  =  ~  Mn)  +  X(H)  -  gQ{X{n))  (8) 


/r(n)=i,  =  -**=Ji  (9) 

=  •  •  •  =  ^«  =  QgiX  (H))  (10) 

We  did  not  write  down  the  equation  for  the  T  mode  above  since  it  is  linearly 
dependent  on  the  others.  From  the  equations  above,  we  get  the  following  equations: 

/r(H)  -  ^(H)  =  0  (11) 

S,(X(H))(1  -t)-[N-flr-n,- RiflrP  +  AW))  “  0  (12) 

P 

We  model  the  queueing  system  in  mode  Q  as  a  W-server  system  with  a  binomial 
input  with  mean  /r(ll)  and  a  fixed  one  slot  service  time  for  each  customer.  Define 

-  _  /r(0) 


which  is  the  utilization  of  the  queueing  ^stem.  The  z-transform  of  the  arrival 


process  is 
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This  system  was  solved  in  the  previous  section,  ind  the  average  number  of  cus¬ 
tomers  in  the  system  (see  Eq.  (5))  should  be  equal  to  Tfg,  the  average  number  of 
stations  in  mode  Q.  Therefore,  we  have 

?Ig  =  - 

where  Zi,i  =  1, . . . ,  (V'  —  W)^  are  the  roots  of  A(z)  —  z^  =  Q  outside  the  unit  circle 
\z\  =  1.  Also,  since  X('S)  =  /r(H)  =  pW^  equations  (11)  and  (12)  become 

5!rp(l-^)"'-'-^W  =  0  (14) 

^^^^(l--)  =  N-(l+pR)nr-n,-pRW  (15) 

(T  p 

It  is  apparent  that  the  equations  (13)  (14)  and  (15)  can  be  solved  for  p.  The 
system  is  said  to  be  stable  if  only  one  solution  exists.  Otherwise,  if  there  is  more 
than  one  solution,  the  system  is  said  to  be  unstable  [17]. 

3.3.3  Throughput  and  Delay 

Let  us  now  define  the  throughput  S{n)  to  be  the  conditional  expectation  of  the 
number  of  correctly  transmitted  data  packets  in  a  slot,  given  that  the  system  is  in 
state  n.  Then,  it  is  clear  that  the  throughput  at  an  equilibrium  point  is  expressed 
as 

S{Yl)  ^  g^iX{n))  =  9,{pW)  (16) 

The  average  padcet  delay,  which  is  the  average  time,  in  number  of  slots,  from 
the  moment  the  packet  is  generated  until  the  instant  the  padcet  is  correctly  received 
by  the  destination,  can  be  calculated  from  Little’s  result  [14]  to  give 

=  (17) 


v-w 


(13) 
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4  Numerical  Results 


In  this  section  we  use  both  the  analytical  model  and  simulation  to  investigate  a 
specific  system  consisting  of  500  stations  interconnected  through  5  wavelengths 
(W=  4  in  this  case).  The  propagation  delay  R  is  equal  to  10  slots.  In  all  figures, 
we  note  the  excellent  agreement  between  the  analysis  and  simulation. 

Figure  8  shows  the  throughput  per  data  wavelength  versus  delay  curve  for  V 
=  10,  p  =  0.2.  and  a  increasing  from  0  to  0.2.  Note  that  in  the  lower  part  of  the 
curve  the  delay  increases  very  slowly  with  the  throughput;  thus  a  high  throughput 
per  data  channel  and  low  delay  can  be  achieved. 

Figures  9  and  10  show  the  effect  of  varying  V  and  W  while  keeping  their  sum 
constant  by  fixing  the  slot  size.  We  can  see  that  for  different  cases  the  maximum 
throughput  always  occurs  at  the  point  where  V  »  eW,  This  is  not  surprising  since 
the  capacity  of  a  slotted  ALOHA  channel  is  1/c.  By  making  V  w  eW  the  capacities 
of  the  V  ALOHA  channels  and  the  W  data  channels  are  balanced.  When  V  c  eW, 
there  is  not  enough  throughput  coming  out  of  the  ALOHA  reservation  channels  to 
keep  the  W  data  channels  bu^.  When  V  eW,  most  stations  are  waiting  (in  the 
Q  mode)  for  a  wavelength  on  which  to  transmit  a  data  packet.  When  V  «  eW, 
the  throughput  is  roughly  equal  to  the  load  offered  by  the  stations. 

Let  Dt  and  Dq  denote  the  average  time  a  station  spends  in  modes  T  and  Q, 
respectively,  in  a  cycle.  By  Little’s  result,  the  throughput  is  equal  to  N/{^  +  Dt  + 
R  +  Dq  +  R).  Setting  both  Dt  and  Dq  to  1,  which  is  the  minimum  possible  value, 
we  obtain  an  upper  bound  for  the  throughput, 

i  +  2(R+l) 

and  a  lower  bound  for  the  delay, 

Dktis  =  -q - —  =  2(iJ  +  1) 

which  are  the  "flat”  region  in  Figures  9  and  10,  respectively.  Note  that  when 
SvB  >  rDhi{Vle,  W)  (the  case  of  V  +  W^=14),  the  flat  region  disappears  since  the 
throughput  is  not  limited  the  user-generated  load. 
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Throughput  por  data  channel 

Figure  8:  Throughput  per  data  channel  versus  delay.  iV=500,  W=4, 1^=10,  iZ=10, 

p=0.2. 
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Figure  9:  Throughput  versus  V  for  fuad  slot  sizes  {V  +  W  constant).  N  = 
i2=10,  <7=0.01,  p=0.2. 


Figure  10:  Delay  versus  V  for  fixed  slot  sizes  {V  +  W  constant).  N 
<7=0.01,  p=0.2. 


R 


The  effect  of  varying  the  propagation  delay,  R,  is  shown  in  Figures  11  and  12. 
Although  the  physical  distance  of  the  network  is  usually  fixed,  R  can  be  varied 
by  varying  the  packet  (and  thus  the  slot)  size.  We  see  that  when  R  is  small,  too 
much  traffic  is  offered  to  the  V  ALOHA  reservation  channel  and  the  throughput 
is  small.  As  we  increase  R,  the  throughput  increases  too.  It  reaches  the  maximum 
aX  R  ~  7,  then  falls  off  because  there  is  not  enough  traffic  in  each  slot  when  R 
becomes  large.  The  tail  of  the  throughput  curve  is  bounded  by  the  upper  bound 
Sub-  The  reason  the  increase  is  so  sharp  around  R  =  6  and  7  is  that  the  system 
changes  from  an  overloaded  system  (see  Figure  7(d)  in  [17])  to  a  bistable  one  and 
then  to  a  stable  one  as  R  increases  from  5  to  7.  This  phenomenon  is  observed  in 
both  the  EPA  analysis  and  the  simulation.  (For  the  case  =  6,  the  average  value 
of  the  two  solutions  obtained  from  EPA  is  plotted.) 

Figure  13  shows  the  influence  of  the  destination  conflicts.  Suppose  i  data  pack¬ 
ets  are  transmitted  in  a  slot.  The  number  of  packets  successfully  received  by  their 
destinations  is  g^ii).  We  plot  the  fraction  of  success,  g^(i)/i,  versus  t  assuming  N  = 
500.  Near-term  technology  limits  W,  the  number  of  transmitter/receiver-tunable 
wavelengths  available,  to  be  fewer  than  about  twenty;  thus  we  see  that  destination 
conflicts  are  not  a  serious  concern. 

5  Conclusions 

In  this  paper  a  wavelength  division  multiple  access  protocol  (with  W  wavelength 
channels)  was  proposed  to  provide  a  high-capacity  optical  fiber  local  area  network 
to  a  large  population  of  N  users.  We  assumed  N  >W.  The  users’  traffic  was  as¬ 
sumed  to  be  bursty  as  in  the  case  of  computer  communications.  The  performance 
of  the  protocol  was  completely  analyzed  for  both  the  infinite  and  finite  popula¬ 
tion  cases.  The  numerical  results  show  that  an  aggregate  throughput  substantially 
larger  than  the  electronic  speed  of  a  single  station  can  be  supported.  The  effects 
of  various  system  parameters  and  their  optimal  selection  were  also  investigated. 
By  comparison  with  simulation,  our  analytical  approximations  were  shown  to  be 
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N  =  500,  W=4,  V=10, 
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Figure  13:  The  fraction  of  success  versus  the  number  of  data  packets  transmitted. 
iV  =  500. 
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excellent. 


Appendix  :  Derivations  of  and  gQ{i) 

Here  we  derive  /r(n),  the  average  number  of  stations  that  successfully  transmit 
a  reservation  minipacket  in  a  slot,  given  that  the  system  is  in  state  n.  Suppose 
that  i  reservation  minipackets  are  transmitted  in  a  slot.  The  probability  that  a 
particular  one  of  them  is  successful  is  equal  to  (1  - 1/ V)‘“*  and  the  average  number 
of  successful  reservations  is  i(l  -  1/V)*~^  Also,  the  conditional  probability  that  i 
reservation  minipackets  are  transmitted  in  a  slot  given  that  the  system  is  in  state 
n  is 

Therefore,  we  have 

Mu)  = 

=  nrPd-f)"--- 

Now  we  derive  gqii),  the  average  number  of  successfully  transmitted  data  pack¬ 
ets  given  that  i  data  packets  are  transmitted,  under  the  assumption  that  a  station 
does  not  transmit  to  itself.  Ck)nsider  a  destination  station,  say  station  k,  for  exam¬ 
ple.  The  probability  that  station  k  is  the  source  of  one  of  the  i  transmitted  packets 
is  i/N,  and  given  this,  the  probability  that  none  of  the  other  (i  -  1)  packets  are 
destined  to  station  is  (1  —  The  probability  that  none  of  the  i  packets 

is  transmitted  by  station  fc  is  (1  -  i/N),  and  given  this  the  probability  that  none 
of  the  i  padcets  are  destined  to  station  A;  is  (1  —  Therefore,  the  probability 
that  at  least  one  among  those  i  packets  is  going  to  station  k  (which  is  also  equal 
to  the  average  number  of  packets  successfully  received  by  station  k)  is  equal  to 

^  ~  +  (1  -  ^)(i  - 
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Therefore,  we  have 


g^(i)  =  N 


1 

N-1 


) 


•-1 


1-(1- 


1  + 
N-V  ^  N{N-1)  ^ 


1 

N-1 
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